20200203
From meeting with Tom Fitzgerald on 26 November 2019:
• Introgression: - Create giant population VCF - choose the datasets. Just a case of merging some VCFs. - Get some Indonesian medaka • LD decay: - LD plots - per chromosome. - Heatmap per chromosome? • Fst plot
Working directory here: /hps/research1/birney/users/ian/mikk_paper
# move to working directory
homehps
cd mikk_paper
# clone git repository
git clone https://github.com/Ian-Brettell/mikk_genome.git
# create directory for VCFs
mkdir vcfs
cp /nfs/research1/birney/projects/medaka/inbred_panel/medaka-alignments-release-94/vcf/medaka_inbred_panel_ensembl_new_reference_release_94.vcf* vcfs
mikk_genome/data/20200206_cram_id_to_line_id.txt
# Find duplicates
ssh ebi
homehps
cd mikk_paper/mikk_genome/
cat data/20200206_cram_id_to_line_id.txt | cut -f2 | cut -f1 -d"_" | sort | uniq -d
Note the following duplicates:
-106 -11 -117 -131 -132 -134 -135 -138 -14 -140 -141 -15 -23 -32 -39 -4 -40 -49 -59 -69 -7 -71 -72 -80 -84
Only take _1 sibling from pair, unless what is excluded is the only survivor based on mikk_behaviour/data/panel_1/20200109_panel_lines.txt.
Query whether we keep the lines that may have died out? Ask Felix.
mikk_genome/data/20200206_cram2line_key_no-sibs.txt
Excluded IDs: mikk_genome/data/20200206_excluded_lines.txt
20200225
Full list of MIKK lines from Felix here: mikk_genome/data/20200210_panel_lines_full.txt
cat ~/Documents/Repositories/mikk_genome/data/20200210_panel_lines_full.txt | cut -f1 -d"-" | sort | uniq -d
List with no sibling lines here: mikk_genome/data/20200227_panel_lines_no-sibs.txt. 64 lines total.
Excluded IDs here: mikk_genome/data/20200227_panel_lines_excluded.txt. 16 lines total.
Replace all dashes with underscores to match cram2line key file
sed 's/-/_/g' data/20200227_panel_lines_no-sibs.txt > data/20200227_panel_lines_no-sibs_us.txt
Extract the lines to keep from the key file.
awk 'FNR==NR {f1[$0]; next} $2 in f1' data/20200227_panel_lines_no-sibs_us.txt data/20200206_cram_id_to_line_id.txt > data/20200227_cram2line_no-sibs.txt
Has 66 lines instead of 63 (because we’re missing 130-2), so there must be duplicates. Find out which ones:
cat data/20200227_cram2line_no-sibs.txt | cut -f2 | cut -f1 -d"_" | sort | uniq -d
32 71 84
Manually removed (data/20200227_duplicates_excluded.txt):
• 24271_7#5 32_2 • 24271_8#4 71_1 • 24259_1#1 84_2
Final version: data/20200227_cram2line_no-sibs.txt
Final version, cram IDs only:
# create list of CRAM IDs in VCF
bcftools query -l vcfs/medaka_inbred_panel_ensembl_new_reference_release_94.vcf > tmp.txt
# confirm that it's in the same order as the column in the line IDs file
cut -f1 mikk_genome/data/20200206_cram_id_to_line_id.txt | tail -n+2 > tmp2.txt
# bash script to compare
file1="tmp.txt"
file2="tmp2.txt"
if cmp -s "$file1" "$file2"; then
printf 'The file "%s" is the same as "%s"\n' "$file1" "$file2"
else
printf 'The file "%s" is different from "%s"\n' "$file1" "$file2"
fi
# clean up
rm tmp*
# create file with no header
tail -n+2 mikk_genome/data/20200206_cram_id_to_line_id.txt > mikk_genome/data/20200203_cram2line_no-header.txt
# replace tab with space
sed 's/\t/ /g' mikk_genome/data/20200203_cram2line_no-header.txt > tmp.txt
mv tmp.txt mikk_genome/data/20200203_cram2line_no-header.txt
# Rename samples with BCFTOOLS
bcftools reheader --output-file vcfs/panel_line-ids.vcf --samples mikk_genome/data/20200203_cram2line_no-header.txt vcfs/medaka_inbred_panel_ensembl_new_reference_release_94.vcf
# test
bcftools query -l vcfs/panel_line-ids.vcf
#[E::bcf_hdr_add_sample] Duplicated sample name '84_2'
#[E::bcf_hdr_add_sample] Duplicated sample name '141_3'
#[E::bcf_hdr_add_sample] Duplicated sample name '32_2'
#[E::bcf_hdr_add_sample] Duplicated sample name '71_1'
#Failed to open vcfs/panel_line-ids.vcf: could not parse header
# create no-sibs file with CRAM ID only
cut -f1 mikk_genome/data/20200227_cram2line_no-sibs.txt > mikk_genome/data/20200227_cram2line_no-sibs_cram-only.txt
# pull out only samples to be included, then recode
bcftools view --output-file vcfs/panel_no-sibs.vcf --samples-file mikk_genome/data/20200227_cram2line_no-sibs_cram-only.txt vcfs/medaka_inbred_panel_ensembl_new_reference_release_94.vcf
# SUCCESS
# recode
bcftools reheader --output vcfs/panel_no-sibs_line-ids.vcf --samples mikk_genome/data/20200227_cram2line_no-sibs.txt vcfs/panel_no-sibs.vcf
# compress
## option 1: bgzip vcfs/panel_no-sibs_line-ids.vcf
## option 2:
bcftools view --output-type z --output-file vcfs/panel_no-sibs_line-ids.vcf.gz vcfs/panel_no-sibs_line-ids.vcf
# create index
bcftools index --tbi vcfs/panel_no-sibs_line-ids.vcf.gz
mkdir stats
bcftools stats vcfs/panel_no-sibs_line-ids.vcf.gz > stats/20200305_panel_no-sibs.txt
• number of samples: 63 • number of records: 29,161,024 • number of no-ALTs: 0 • number of SNPs: 24,031,673 • number of MNPs: 0 • number of indels: 5575994 • number of others: 449159 • number of multiallelic sites: 2957366 • number of multiallelic SNP sites: 1434908
• ts: 12640470 • tv: 11886484
• ts/tv: 1.06
mkdir refs
cp /nfs/research1/birney/projects/medaka/inbred_panel/medaka-alignments-release-94/ref/Oryzias_latipes.ASM223467v1* refs/
mkdir vcfs/split_by_chr
for i in $(seq 1 24); do
bsub -o log/split_by_chr_$i.out -e log/split_by_chr_$i.err \
"bcftools filter \
--regions $i \
--output-type z \
--output vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz \
vcfs/panel_no-sibs_line-ids.vcf.gz";
done
mkdir stats/by_chr
for i in $(seq 1 24); do
bsub -o log/stats_by_chr_$i.out -e log/stats_by_chr_$i.err \
"bcftools stats \
vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz > stats/by_chr/$i.txt";
done
mkdir ld
mkdir ld/20200305_panel_maf-0.03_window-50kb
for i in $(seq 1 24); do
bsub -M 30000 -n 8 -o log/ld_$i.out -e log/ld_$i.err \
"vcftools \
--gzvcf vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz \
--geno-r2 \
--ld-window-bp 50000 \
--maf 0.03 \
--out ld/20200305_panel_maf-0.03_window-50kb/$i";
done
# creates huge files (~10G). Take random 1M lines
## create set up directory
mkdir ld/20200305_panel_maf-0.03_window-50kb_thinned
##
for i in $(seq 1 24); do
bsub -o log/ld_trim_$i.out -e log/ld_trim_$i.err \
"head -1 > ld/20200305_panel_maf-0.03_window-50kb_thinned/$i\_1m.txt && \
shuf -n 1000000 ld/20200305_panel_maf-0.03_window-50kb/$i.geno.ld >> ld/20200305_panel_maf-0.03_window-50kb_thinned/$i\_1m.txt";
done
head -1 > ld/20200305_panel_maf-0.03_window-50kb_thinned/24_1m.txt
shuf -n 1000000 ld/20200305_panel_maf-0.03_window-50kb/24.geno.ld >> ld/20200305_panel_maf-0.03_window-50kb_thinned/24_1m.txt
#Try with plink and --thin argument
mkdir ld/20200305_panel_maf-0.03_window-50kb_plink
/nfs/software/birney/plink \
--vcf vcfs/split_by_chr/panel_no-sibs_chr-24.vcf.gz \
--double-id \
--thin 0.025 \
--snps-only \
--geno 0.3 \
--r2 \
--ld-window-r2 0 \
--ld-window 999999 \
--ld-window-kb 50000 \
--chr-set 24 \
--out ld/20200305_panel_maf-0.03_window-50kb_plink/test
# works
# TRUE
for i in $(seq 24 24); do
bsub -M 30000 -o log/ld_plink_$i.out -e log/ld_plink_$i.err \
"/nfs/software/birney/plink \
--vcf vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz \
--double-id \
--thin 0.025 \
--snps-only \
--geno 0.3 \
--r2 \
--ld-window-r2 0 \
--ld-window 999999 \
--ld-window-kb 50000 \
--chr-set 24 \
--out ld/20200305_panel_maf-0.03_window-50kb_plink/$i";
done
–double-id removes the issue of the line IDs having underscores in them. –thin 0.025 takes only 0.025 of the variants –snps-only takes only SNPs –geno 0.3 filters out variants with missing call rates above 0.3 –r2 gets the R^2 –ld-window-r2 0 includes all pairs of variants, including those with R^2 less than 0.2 –ld-window 999999 sets the maximum number of variants allowed between a pair of variants –ld-window-kb 50000 sets the distance of the comparison window –chr-set 24 tells plink that the chromosome number is 24 so that it doesn’t get confused that they’re not human.
for i in $(seq 1 24); do
bsub -M 20000 -n 4 -o log/missing_$i.out -e log/missing_$i.err \
"vcftools \
--gzvcf vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz \
--missing-site \
--out missing/$i";
done
# Find duplicates
cut -f2 mikk_genome/data/20200206_cram_id_to_line_id.txt | sort | uniq -cd
• 2 141_3 • 2 32_2 • 2 71_1 • 2 84_2
Edited them as follows:
• 2 141_3-2 • 2 32_2-2 • 2 71_1-2 • 2 84_2-2
Saved here: mikk_genome/data/20200305_cram2line_full_dupes-edited.txt
# rehead
bcftools reheader \
--output vcfs/full-run_line-ids.vcf \
--samples mikk_genome/data/20200305_cram2line_full_dupes-edited.txt \
vcfs/medaka_inbred_panel_ensembl_new_reference_release_94.vcf
# compress
bcftools view \
--output-type z \
--output-file vcfs/full-run_line-ids.vcf.gz \
vcfs/full-run_line-ids.vcf
# index
bcftools index \
--tbi vcfs/full-run_line-ids.vcf.gz
20200602
data_files <- list.files("~/Documents/Data/20200305_panel_maf-0.03_window-50kb_thinned",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200305_panel_maf-0.03_window-50kb_thinned")
data_files_trunc <- gsub("_1m.txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = F)
names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
data_list <- lapply(data_list, function(chr){
chr$distance_kb <- abs(chr$snp_1 - chr$snp_2) / 1000
return(chr)
})
data_df <- dplyr::bind_rows(data_list)
# TEST
test_df <- dplyr::sample_n(data_df, 10000)
test_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
# TRUE
data_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
ggsave(filename = paste("20200602_ld_decay_subset_allchr", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 15,
height = 15,
units = "cm",
dpi = 500)
# Distribution of counts in whole dataset
hist(data_df$count)
# how many?
length(which(data_df$count == 63))
# TRUE
data_df %>%
dplyr::filter(count == 63) %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
ggsave(filename = paste("20200603_ld_decay_subset_allchr_allsmpls", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 20,
units = "cm",
dpi = 500)
full_call_df <- data_df %>%
dplyr::filter(count == 63 & r2 == 1 & distance_kb > 20)
full_call_df_cnt <- full_call_df %>%
dplyr::count(chr)
# Plot
full_call_df_cnt %>%
ggplot(aes(chr, n, fill = factor(chr))) +
geom_col() +
theme_bw() +
guides(fill = F)
View(full_call_df %>% dplyr::filter(distance > 20))
Looking at individual SNPs in the VCF using grep, most of them have low MAFs (3 < x < 5). Run LD calcs again using threshold of 5%.
mkdir ld/20200305_panel_maf-0.05_window-50kb
for i in $(seq 1 24); do
bsub -M 30000 -n 8 -o log/ld_$i.out -e log/ld_$i.err \
"vcftools \
--gzvcf vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz \
--geno-r2 \
--ld-window-bp 50000 \
--maf 0.05 \
--out ld/20200305_panel_maf-0.05_window-50kb/$i";
done
mkdir ld/20200305_panel_maf-0.03_window-50kb_full-calls
for i in $(find ld/20200305_panel_maf-0.03_window-50kb/23.geno.ld); do
name=$(basename $i | cut -f1 -d".");
awk_subscript=$(echo "awk '\$4 == 63' $i >> ld/20200305_panel_maf-0.03_window-50kb_full-calls/$name.txt");
script=$(echo "head -1 $i > ld/20200305_panel_maf-0.03_window-50kb_full-calls/$name.txt; $awk_subscript");
bsub -M 30000 -o log/ld_full-calls_$name.out -e log/ld_full-calls_$name.err $(printf $script);
done
# Can't do it like this - has a problem with the awk script. Have to do it in one job:
for i in $(find ld/20200305_panel_maf-0.03_window-50kb/*); do
name=$(basename $i | cut -f1 -d".");
head -1 $i > ld/20200305_panel_maf-0.03_window-50kb_full-calls/$name.txt;
awk '$4 == 63' $i >> ld/20200305_panel_maf-0.03_window-50kb_full-calls/$name.txt
done
R script here: mikk_genome/code/scripts/20200602_ld_decay_plot.R
for i in $(find ld/20200305_panel_maf-0.03_window-50kb_full-calls/* | head -1); do
date_today=$(date +'%Y%m%d');
name=$(basename $i | cut -f1 -d".");
bsub -M 50000 -o log/$date_today\_$name\_plotld.out -e log/$date_today\_$name\_plotld.err "Rscript --vanilla mikk_genome/code/scripts/20200602_ld_decay_plot.R $i plots/ $date_today";
done
# Can't handle it with 50MB memory
for i in $(find ld/20200305_panel_maf-0.05_window-50kb_full-calls/* | head -1); do
date_today=$(date +'%Y%m%d');
name=$(basename $i | cut -f1 -d".");
bsub -M 50000 -o log/$date_today\_plotld_$name.out -e log/$date_today\_plotld_$name.err "Rscript --vanilla mikk_genome/code/scripts/20200602_ld_decay_plot.R $i plots/ $date_today";
done
# STILL has the line at R^2 == 1!!
20200707
Try all again, this time removing indels.
mkdir ld/20200707_panel_maf-0.05_window-50kb
VCFToolsfor i in $(find vcfs/split_by_chr/*); do
date_today=$(date +'%Y%m%d');
chr=$(basename $i | awk -F'-' '{print $3}' | sed 's/.vcf.gz//');
bsub -M 30000 -o log/$date_today\_ld_$chr.out -e log/$date_today\_ld_$chr.err \
"vcftools \
--gzvcf $i \
--ld-window-bp 50000 \
--maf 0.05 \
--max-alleles 2 \
--min-alleles 2 \
--remove-indels \
--geno-r2 \
--out ld/20200707_panel_maf-0.05_window-50kb/$chr";
done
# needs --recode flag!!
vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids.vcf.gz \
--max-missing 1 \
--recode \
--stdout > vcfs/panel_no-sibs_line-ids_no-missing.vcf
## compress
bcftools view --output-type z --output-file vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz vcfs/panel_no-sibs_line-ids_no-missing.vcf
# create index
bcftools index --tbi vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz
• number of samples: 63 • number of records: 20,086,433 • number of no-ALTs: 0 • number of SNPs: 16,956,405 • number of MNPs: 0 • number of indels: 3329681 • number of others: 0 • number of multiallelic sites: 1333182 • number of multiallelic SNP sites: 260770
# make directory
mkdir ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99/
# run
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -M 30000 -o log/$date_today\_ld_maf-max_$chr.out -e log/$date_today\_ld_maf-max_$chr.err \
"vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--ld-window-bp 50000 \
--chr $i \
--maf 0.05 \
--max-maf 0.99 \
--max-alleles 2 \
--min-alleles 2 \
--remove-indels \
--geno-r2 \
--out ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99/$i";
done
## create set up directory
mkdir ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned/
## Extract random 1M pairs
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -o log/$date_today\_ld_trim_$i.out -e log/$date_today\_ld_trim_$i.err \
"head -1 ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99/$i.geno.ld > ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned/$i\_1m.txt && \
shuf -n 1000000 ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99/$i.geno.ld >> ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned/$i\_1m.txt";
done
## send to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned/ ~/Documents/Data/20200707_mikk_ld
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned")
data_files_trunc <- gsub("_1m.txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
data_list <- lapply(data_list, function(chr){
chr$distance_kb <- abs(chr$snp_1 - chr$snp_2) / 1000
return(chr)
})
data_df <- dplyr::bind_rows(data_list)
# TEST
test_df <- dplyr::sample_n(data_df, 10000)
test_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
# TRUE
data_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
ggsave(filename = paste("20200708_ld_decay_subset_allchr_max_maf_0.99", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 15,
height = 15,
units = "cm",
dpi = 500)
Looks better - with reduced density up the top for some chromosomes - but for others there are still strong lines with an r^2 of 1.
Try setting the max MAF at 0.5.
# make directory
mkdir ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50/
# run
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -M 30000 -o log/$date_today\_ld_maf-max_$chr.out -e log/$date_today\_ld_maf-max_$chr.err \
"vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--ld-window-bp 50000 \
--chr $i \
--maf 0.05 \
--max-maf 0.50 \
--max-alleles 2 \
--min-alleles 2 \
--remove-indels \
--geno-r2 \
--out ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50/$i";
done
## create set up directory
mkdir ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned/
## Extract random 1M pairs
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -o log/$date_today\_ld_trim_$i.out -e log/$date_today\_ld_trim_$i.err \
"head -1 ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50/$i.geno.ld > ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned/$i\_1m.txt && \
shuf -n 1000000 ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50/$i.geno.ld >> ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned/$i\_1m.txt";
done
## send to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned/ ~/Documents/Data/20200707_mikk_ld
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned")
data_files_trunc <- gsub("_1m.txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
data_list <- lapply(data_list, function(chr){
chr$distance_kb <- abs(chr$snp_1 - chr$snp_2) / 1000
return(chr)
})
data_df <- dplyr::bind_rows(data_list)
# TRUE
data_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
ggsave(filename = paste("20200709_ld_decay_subset_allchr_max-maf-0.50", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 15,
height = 15,
units = "cm",
dpi = 500)
# make directory
mkdir ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/
# run
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -M 30000 -o log/$date_today\_ld_maf-max_$chr.out -e log/$date_today\_ld_maf-max_$chr.err \
"vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--ld-window-bp 50000 \
--chr $i \
--maf 0.10 \
--max-maf 0.90 \
--max-alleles 2 \
--min-alleles 2 \
--remove-indels \
--geno-r2 \
--out ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/$i";
done
## create set up directory
mkdir ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/
## Extract random 1M pairs
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -o log/$date_today\_ld_trim_$i.out -e log/$date_today\_ld_trim_$i.err \
"head -1 ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/$i.geno.ld > ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/$i\_1m.txt && \
shuf -n 1000000 ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/$i.geno.ld >> ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/$i\_1m.txt";
done
## send to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/ ~/Documents/Data/20200707_mikk_ld
### NOTE: weird double header in chr14 file. Remove
head -1 ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt > ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt.tmp
grep -F -v "R^2" ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt >> ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt.tmp && mv ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt.tmp ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned")
data_files_trunc <- gsub("_1m.txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
# df$snp_1 <- as.integer(df$snp_1)
# df$snp_2 <- as.integer(df$snp_2)
# df$chr <- as.integer(df$chr)
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
data_list <- lapply(data_list, function(chr){
chr$distance_kb <- abs(chr$snp_1 - chr$snp_2) / 1000
# chr$chr <- as.integer(chr$chr)
return(chr)
})
data_df <- dplyr::bind_rows(data_list)
# TEST
test_df <- dplyr::sample_n(data_df, 10000)
test_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
# TRUE
data_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
ggsave(filename = paste("20200715_ld_decay_subset_allchr_min-maf-0.10_max-maf-0.90", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 15,
height = 15,
units = "cm",
dpi = 500)
# No missing
bcftools +fill-tags \
vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--output-type z \
--output vcfs/panel_no-sibs_line-ids_no-missing_with-maf.vcf.gz \
-- \
--tags MAF
## Count of variants: 20,086,433
# No missing, biallelic SNPs only
bcftools view \
--min-alleles 2 \
--max-alleles 2\
--types snps \
--output-type z \
--output vcfs/panel_no-sibs_line-ids_no-missing_with-maf_bi-snps.vcf.gz \
vcfs/panel_no-sibs_line-ids_no-missing_with-maf.vcf.gz
## Count of variants: 16,035,052
# make directory
mkdir maf
# get stats
bcftools query \
--format '%CHROM\t%POS\t%INFO/MAF\n' \
--output maf/20200707_maf.txt \
vcfs/panel_no-sibs_line-ids_no-missing_with-maf_bi-snps.vcf.gz
# send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/maf/20200707_maf.txt ~/Documents/Data/20200707_mikk_ld
maf_dat <- read.table("~/Documents/Data/20200707_mikk_ld/20200707_maf.txt",
header = F, sep = "\t")
colnames(maf_dat) <- c("chr", "pos", "maf")
maf_dat %>%
ggplot() +
geom_histogram(aes(x = maf,
y=..count../1000000),
bins = 40,
fill = "#2A9D8F",
colour = "#264653") +
theme_bw() +
guides(fill = F) +
xlab("Minor allele frequencies") +
ylab("Count (in millions of sites)")
ggsave(filename = paste("20200707_maf_freqs", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
NOTE Lessons from the start of the PhD:
• Need to first recode SNPs into the 1234 format. • Then recode that into Haploview format. • Replace asterisks with 0 in PED file
# make directory for outputs
mkdir plink
# run over VCF with no missing sites
plink \
--vcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--maf 0.05 \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--chr 1-24 \
--allow-extra-chr \
--allele1234 \
--out plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234
# recode for HV
plink \
--bfile plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234 \
--recode HV \
--maf 0.05 \
--double-id \
--snps-only \
--biallelic-only \
--chr 1-24 \
--allow-extra-chr \
--out plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234_hv
# replace * with 0
for i in $(find plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234_hv*.ped); do sed -i 's/\*/0/g' $i; done
# send to local to play with
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234_hv.chr* ~/Documents/Data/20200707_mikk_ld/20200707_plink
# Too many comparisons makes Haploview crash. Try thinning.
## First send bfiles to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234.* ~/Documents/Data/20200707_mikk_ld/20200709_plink
# recode for HV (on local)
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234 \
--recode HV \
--maf 0.05 \
--double-id \
--snps-only \
--biallelic-only \
--chr 1-24 \
--allow-extra-chr \
--thin 0.05 \
--out ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.05/thinned-0.05
# replace * with 0
for i in $(find /Users/brettell/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.05/thinned-0.05*.ped); do sed -i '.bak' 's/\*/0/g' $i; done
# run Haploview
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.05/thinned-0.05.chr-24.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.05/thinned-0.05.chr-24.info \
-maxDistance 1000
# still crashes. Try with 0.025
mkdir ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234 \
--recode HV \
--maf 0.05 \
--double-id \
--snps-only \
--biallelic-only \
--chr 1-24 \
--allow-extra-chr \
--thin 0.025 \
--out ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025
# replace * with 0
for i in $(find /Users/brettell/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025*.ped); do sed -i '.bak' 's/\*/0/g' $i; done
# run Haploview
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-24.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-24.info \
-maxDistance 1000
# Creates plot, but crashes when saving it.
gaston packagemkdir plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05
# make BED
plink \
--vcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--maf 0.05 \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--chr 1-24 \
--allow-extra-chr \
--out plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05
# recode for 012
plink \
--bfile plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05 \
--recode A \
--chr 1-24 \
--allow-extra-chr \
--out plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05_recode-012
# recode for 012 transposed
plink \
--bfile plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05 \
--recode A-transpose \
--chr 1-24 \
--allow-extra-chr \
--out plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05_recode-012
# send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05_recode-012.raw ~/Documents/Data/20200707_mikk_ld/20200714_plink
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05_recode-012.traw ~/Documents/Data/20200707_mikk_ld/20200714_plink
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05.bim ~/Documents/Data/20200707_mikk_ld/20200714_plink
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05.fam ~/Documents/Data/20200707_mikk_ld/20200714_plink
library(gaston)
library(tidyverse)
mikk <- read.table("~/Documents/Data/20200707_mikk_ld/20200714_plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05_recode-012.traw",
header = T)
# BIM
mikk.bim <- read.table("~/Documents/Data/20200707_mikk_ld/20200714_plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05.bim",
header = F,
col.names = c("chr", "id", "dist", "pos", "A1", "A2"))
# FAM
mikk.fam <- read.table("~/Documents/Data/20200707_mikk_ld/20200714_plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05.fam",
header = F,
col.names = c("famid", "id", "father", "mother", "sex", "pheno"))
# rename sample IDs
mikk.gen <- mikk
colnames(mikk.gen)[7:ncol(mikk.gen)] <- mikk.fam$id
# remove unneeded columns
mikk.gen <- mikk.gen[, c(1, 7:ncol(mikk.gen))]
# split by chromosome
mikk.gen_lst <- split(mikk.gen, f = mikk.gen$CHR)
# remove CHR column
mikk.gen_lst <- lapply(mikk.gen_lst, function(chr){
chr$CHR <- NULL
return(chr)
})
# split BIM by chr as well
## get counts
#mikk.bim %>% group_by(chr) %>% count
## split by chr
mikk.bim_lst <- split(mikk.bim, f = mikk.bim$chr)
set.seed(45)
targ_ind <- sort(sample(nrow(mikk.gen_lst$`8`), 1000))
# pull out 50 SNPs
mikk.gen_8 <- mikk.gen_lst$`8`[targ_ind, ]
mikk.gen_8 <- t(as.matrix(mikk.gen_8))
mikk.bim_8 <- mikk.bim_lst$`8`[targ_ind, ]
# create bed matrix
x <- as.bed.matrix(mikk.gen_8, bim = mikk.bim_8)
# compute LD
ld.x <- gaston::LD(x, c(1,ncol(x)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
LD.plot( ld.x,
snp.positions = x@snps$pos,
max.dist = 1000000,
write.ld = NULL,
write.snp.id = F,
pdf.file = "~/Documents/Docs/medaka pics/20200602_mikk_genome/20200714_test.pdf")
WORKS!
Run on all chrs
# get vector of seeds
set.seed(5)
seeds <- sample(seq(1, 100), 24)
# run over list
counter <- 0
lapply(mikk.gen_lst, function(chr){
counter <<- counter + 1
# get seed
set.seed(seeds[counter])
targ_ind <- sort(sample(nrow(mikk.gen_lst[[counter]]), 1000))
# pull out 50 SNPs
mikk.gen <- mikk.gen_lst[[counter]][targ_ind, ]
mikk.gen <- t(as.matrix(mikk.gen))
mikk.bim <- mikk.bim_lst[[counter]][targ_ind, ]
# create bed matrix
x <- as.bed.matrix(mikk.gen, bim = mikk.bim)
# compute LD
ld.x <- gaston::LD(x, c(1,ncol(x)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
LD.plot(ld.x,
snp.positions = x@snps$pos,
max.dist = 1000000,
write.ld = NULL,
write.snp.id = F,
pdf.file = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/",
gsub("-", "", Sys.Date()),
"_chr",
counter,
".pdf",
sep = ""))
})
set.seed(5)
seeds <- sample(seq(1, 100), 24)
counter <- 0
mikk_bed_lst <- lapply(mikk.gen_lst, function(chr){
counter <<- counter + 1
# turn chr into list
chr <- list()
# get bed matrix
## set up GEN and BIM files
mikk.gen <- mikk.gen_lst[[counter]]
mikk.gen <- t(as.matrix(mikk.gen))
mikk.bim <- mikk.bim_lst[[counter]]
## form BED matrix
x <- gaston::as.bed.matrix(mikk.gen, bim = mikk.bim)
chr[["bed_mat"]] <- x
# get LD matrix
## get seed
set.seed(seeds[counter])
targ_ind <- sort(sample(nrow(mikk.gen_lst[[counter]]), 1000))
chr[["target_snps_indexes"]] <- targ_ind
## pull out select SNPs
mikk.gen <- mikk.gen_lst[[counter]][targ_ind, ]
mikk.gen <- t(as.matrix(mikk.gen))
mikk.bim <- mikk.bim_lst[[counter]][targ_ind, ]
# create bed matrix
x <- as.bed.matrix(mikk.gen, bim = mikk.bim)
# compute LD
ld.x <- gaston::LD(x, c(1,ncol(x)))
chr[["LD"]] <- ld.x
return(chr)
# # replace NaNs with 0
# ld.x[which(is.na(ld.x))] <- 0
})
all_snps <- lapply(mikk_bed_lst, function(x){
x <- x$bed_mat@snps
return(x)
})
all_snps <- dplyr::bind_rows(all_snps)
# 444 have missing values:
length(which(is.na(all_snps$maf)))
# remove
all_snps <- all_snps[which(!is.na(all_snps$maf)), ]
# which ones have N2 == 63?
View(all_snps[which(all_snps$N2 == 63), ])
unique(all_snps$chr[which(all_snps$N2 == 63)])
# just in Chr 24. Check out distribution
hist(all_snps$N2[all_snps$chr == 24])
# how many have a MAF < 0.05?
length(which(all_snps$maf < 0.05))
# Nearly 10%!
# Split back and get indexes so we can remove them before running the LD plots again
all_snps_lst <- split(all_snps, f = all_snps$chr)
snps_to_remove <- lapply(all_snps_lst, function(x){
which(x$maf < 0.05)
})
# Only in chrs 23 and 24.
# This looks pretty weird too...
View(all_snps[all_snps$maf < 0.05, ])
# Try calculating maf manually to see if it's correct
all_snps$maf_manual <- ifelse(all_snps$N1 == 63,
0.5,
ifelse(all_snps$N0 > all_snps$N2,
((all_snps$N2*2) + all_snps$N1) / 126,
((all_snps$N0*2) + all_snps$N1) / 126))
# how many are different?
length(which(all_snps$maf_manual != all_snps$maf))
View(all_snps[which(all_snps$maf_manual != all_snps$maf), ])
# just check that all add up to 63
which(all_snps$N0 + all_snps$N1 + all_snps$N2 != 63)
# Yep.
# any still under 0.05?
which(all_snps$maf_manual < 0.05)
# dammit. Which ones?
unique(all_snps$chr[which(all_snps$maf_manual < 0.05)])
# Just in chr 24. Weird. Must be caused by Plink not handling 24 autosomes.
all_snps %>%
ggplot() +
geom_histogram(aes(x = maf_manual,
y=..count../1000000),
bins = 40,
fill = "#f4a261",
colour = "#e76f51") +
theme_bw() +
guides(fill = F) +
xlab("Minor allele frequencies") +
ylab("Count (in millions of sites)")
# get data frame of matrix indices with R^2 of 1
test <- data.frame(which(mikk_bed_lst[["1"]][["LD"]] == 1, arr.ind = T))
# remove diagonals
test <- test[which(test$row != test$col), ]
# get indices for full snp list
snp_1_ind <- mikk_bed_lst[["1"]][["target_snps_indexes"]][test$row]
snp_2_ind <- mikk_bed_lst[["1"]][["target_snps_indexes"]][test$col]
# get distances between those snps
high_ld_df <- data.frame(snp_1_pos = mikk_bed_lst[["1"]]$bed_mat@snps$pos[snp_1_ind],
snp_2_pos = mikk_bed_lst[["1"]]$bed_mat@snps$pos[snp_2_ind],
snp_1_maf = mikk_bed_lst[["1"]]$bed_mat@snps$maf[snp_1_ind],
snp_2_maf = mikk_bed_lst[["1"]]$bed_mat@snps$maf[snp_2_ind])
high_ld_df$distance_kb <- abs((high_ld_df$snp_2_pos - high_ld_df$snp_1_pos)/1e6)
high_ld_lst <- lapply(mikk_bed_lst, function(x){
# get data frame of matrix indices with R^2 of 1
test <- data.frame(which(x[["LD"]] > 0.9, arr.ind = T))
# remove diagonals
test <- test[which(test$row != test$col), ]
# get count
x[["high_ld_count"]] <- test
# get indices for full snp list
snp_1_ind <- mikk_bed_lst[["1"]][["target_snps_indexes"]][test$row]
snp_2_ind <- mikk_bed_lst[["1"]][["target_snps_indexes"]][test$col]
# get distances between those snps
high_ld_df <- data.frame(snp_1_pos = mikk_bed_lst[["1"]]$bed_mat@snps$pos[snp_1_ind],
snp_2_pos = mikk_bed_lst[["1"]]$bed_mat@snps$pos[snp_2_ind],
snp_1_maf = mikk_bed_lst[["1"]]$bed_mat@snps$maf[snp_1_ind],
snp_2_maf = mikk_bed_lst[["1"]]$bed_mat@snps$maf[snp_2_ind])
high_ld_df$distance_kb <- abs((high_ld_df$snp_2_pos - high_ld_df$snp_1_pos)/1e6)
x[["high_ld_df"]] <- high_ld_df
return(x)
})
test <- data.frame(high_ld_lst$`17`$LD)
colnames(test) <- high_ld_lst$`17`$bed_mat@snps$pos[high_ld_lst$`17`$target_snps_indexes]
test$snp_1_pos <- high_ld_lst$`17`$bed_mat@snps$pos[high_ld_lst$`17`$target_snps_indexes]
test2 <- pivot_longer(test,
cols = -snp_1_pos,
names_to = "snp_2_pos",
values_to = "r2")
test2$snp_1_pos <- as.integer(test2$snp_1_pos)
test2$snp_2_pos <- as.integer(test2$snp_2_pos)
test2$distance_kb <- abs((test2$snp_1_pos - test2$snp_2_pos)/1000000)
# remove diagonals
test2 <- test2[test2$distance_kb != 0, ]
counter <- 0
ld_df_lst <- lapply(high_ld_lst, function(x){
counter <<- counter + 1
test <- data.frame(x$LD)
colnames(test) <- x$bed_mat@snps$pos[x$target_snps_indexes]
test$snp_1_pos <- x$bed_mat@snps$pos[x$target_snps_indexes]
test2 <- pivot_longer(test,
cols = -snp_1_pos,
names_to = "snp_2_pos",
values_to = "r2")
test2$snp_1_pos <- as.integer(test2$snp_1_pos)
test2$snp_2_pos <- as.integer(test2$snp_2_pos)
test2$distance_kb <- abs((test2$snp_1_pos - test2$snp_2_pos)/1000)
# remove diagonals
test2 <- test2[test2$distance_kb != 0, ]
# get chr
test2$chr <- factor(names(high_ld_lst)[counter], levels = seq(1, 24))
# put in order
test2 <- test2 %>% dplyr::select(chr, snp_1_pos, snp_2_pos, distance_kb, r2)
# assign
x[["r2_df"]] <- test2
return(x)
})
# Extract into data frame
ld_df <- lapply(ld_df_lst, function(x){
x <- x$r2_df
return(x)
})
ld_df <- dplyr::bind_rows(ld_df)
# remove comparisons beyond 50kb
ld_df_50kb <- ld_df[ld_df$distance_kb < 50, ]
# Plot
ld_df_50kb %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.5) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
# get vector of seeds
set.seed(5)
seeds <- sample(seq(1, 100), 24)
# run over list
counter <- 0
ld_plot_lst <- lapply(mikk.gen_lst, function(chr){
chr <- list()
counter <<- counter + 1
# create BED MAT from full files
mikk.gen <- mikk.gen_lst[[counter]]
mikk.gen <- t(as.matrix(mikk.gen))
mikk.bim <- mikk.bim_lst[[counter]]
x <- as.bed.matrix(mikk.gen, bim = mikk.bim)
chr[["bed_mat_full"]] <- x
# get indexes to keep
filt_ind <- which(x@snps$maf > 0.10)
# filter from original files and make BED again
mikk.gen_filt <- mikk.gen_lst[[counter]][filt_ind, ]
mikk.gen_filt <- t(as.matrix(mikk.gen_filt))
mikk.bim_filt <- mikk.bim_lst[[counter]][filt_ind, ]
x <- as.bed.matrix(mikk.gen_filt, bim = mikk.bim_filt)
chr[["bed_mat_filt"]] <- x
# get seed
set.seed(seeds[counter])
# made GEN file again
mikk.gen_filt <- mikk.gen_lst[[counter]][filt_ind, ]
# get sample indices
targ_ind <- sort(sample(nrow(mikk.gen_filt), 1000))
# pull out sample SNPs
mikk.gen_samp <- mikk.gen_filt[targ_ind, ]
mikk.gen_samp <- t(as.matrix(mikk.gen_samp))
mikk.bim_samp <- mikk.bim_filt[targ_ind, ]
# create bed matrix
x <- as.bed.matrix(mikk.gen_samp, bim = mikk.bim_samp)
chr[["bed_mat_samp"]] <- x
# compute LD
ld.x <- gaston::LD(x, c(1,ncol(x)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
# LD.plot(ld.x,
# snp.positions = x@snps$pos,
# max.dist = 1000000,
# write.ld = NULL,
# write.snp.id = F,
# pdf.file = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/20200715_ld_maf-0.10/",
# "20200715_chr",
# counter,
# ".pdf",
# sep = ""))
return(chr)
})
file_in <- "~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/10_1m.txt"
# Read in data
data <- read.delim(file_in,
sep = "\t",
header = F,
skip = 1)
# Set column names
colnames(data) <- c("chr", "snp_1", "snp_2", "count", "r2")
# Create distance variables
data$distance <- abs(data$snp_1 - data$snp_2)
data$distance_kb <- abs(data$snp_1 - data$snp_2) / 1000
# Create bins
data$bin_dist_kb <- ggplot2::cut_interval(data$distance_kb, n = 50, labels = F)
data$bin_dist <- ggplot2::cut_interval(data$distance, n = 500)
# Get boundaries
data$bin <- ggplot2::cut_interval(data$distance, n = 500)
# Extract first boundaries
data$bin_bdr <- as.numeric(stringr::str_split(data$bin, ",", simplify = T)[,1 ] %>%
stringr::str_replace_all("\\(", "") %>%
stringr::str_replace_all("\\[", ""))
# Group by bin and get means
test <- data %>%
dplyr::group_by(bin_bdr) %>%
dplyr::summarise(mean = mean(r2, na.rm = T))
# Plot
ggplot(test) +
geom_line(aes(bin_bdr, mean))
Script here: code/scripts/20200715_r2_decay_mean.R
mkdir ld/20200715_mean_r2
# TEST
Rscript --vanilla mikk_genome/code/scripts/20200715_r2_decay_mean.R \
ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/18.geno.ld \
ld/20200715_mean_r2
# WORKS
# TRUE
for i in $(find ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/*); do
name=$(basename $i | cut -f1 -d".");
bsub -M 50000 -n 4 -o log/20200715_$name\_mean-r2.out -e log/20200715_$name\_mean-r2.err "Rscript --vanilla mikk_genome/code/scripts/20200715_r2_decay_mean.R $i ld/20200715_mean_r2";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200715_mean_r2/ ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df <- dplyr::bind_rows(data_list, .id = "chr")
r2_df$chr <- factor(r2_df$chr, levels = seq(1, 24))
# get kb measure
r2_df$bin_bdr_kb <- r2_df$bin_bdr / 1000
r2_df %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome")
ggsave(filename = paste("20200715_mean-r2", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_df %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (Kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~chr, nrow = 6, ncol = 4) +
guides(colour = F)
ggsave(filename = paste("20200715_mean-r2_facet", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
20200723 ### Do again, but shrinking the y-axis labels slighlty to avoid overlap
r2_df %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (Kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~chr, nrow = 6, ncol = 4) +
guides(colour = F) +
theme(axis.text = element_text(size = 8),
strip.text = element_text(size = 8),
panel.grid = element_blank())
ggsave(filename = paste("20200723_mean-r2_facet_50kb", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 20,
units = "cm",
dpi = 500)
end 20200723
mkdir ld/20200715_mean_r2_1kb-max
# TRUE
for i in $(find ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/*); do
name=$(basename $i | cut -f1 -d".");
bsub -M 30000 -n 4 -o log/20200715_$name\_mean-r2_1kb-max.out -e log/20200715_$name\_mean-r2_1kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200715_r2_decay_mean_1kb-lim.R $i ld/20200715_mean_r2_1kb-max";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200715_mean_r2_1kb-max/ ~/Documents/Data/20200707_mikk_ld/
vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids.vcf.gz \
--het \
--out het/20200602
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2_1kb-max/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2_1kb-max/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb$chr <- factor(r2_df_1kb$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb$bin_bdr_kb <- r2_df_1kb$bin_bdr / 1000
r2_df_1kb %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome")
ggsave(filename = paste("20200715_mean-r2_1kb-lim", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_df_1kb %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
guides(colour = F)
ggsave(filename = paste("20200715_mean-r2_1kb-lim_no-guide", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
20200723
Do separate ones for each chr
# create hue palette
chr_pal <- scales::hue_pal()(24)
lapply(seq(1:24), function(x){
# plot
plot <- r2_df_1kb %>%
dplyr::filter(chr == x) %>%
ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
scale_colour_manual(values = chr_pal[x]) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
guides(colour = F) +
theme(panel.grid = element_blank())
# save
ggsave(filename = paste(x, ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/20200723_decay-by-chr_1kb",
width = 15,
height = 8.25,
units = "cm",
dpi = 500)
})
Insets are 3 x 1.65
end 20200723
sum(unlist(lapply(ld_plot_lst, function(x) nrow(x[["bed_mat_filt"]]@snps))))
20200715 Birney Group meeting
Get human LD plot – contrast. Major one we want to make. And it’s not quite as good as Drosophila and accept that.
Tracking down these LD blocks. Investigate individually. Find explanation for one. Describe the reason why.
For these variants in the blocks, the samples will split be ancestry. As we raise the MAF, we hide the unbalanced scenarios. Take SNPs in blocks. Make a matrix. SNPs by samples, then cluster. Try Hclust. Hopefully we have some on one side of the tree, some on the other. Tree won’t look balanced, but hopefully we have a sample on each side.
The other thing they could be is introgression from Northern medaka line. Came in and refuses to recombine with Southern medaka.
Could be regions of persistent heterozygosity? Sample regions of persistent heterozygosity and make bandage plots for them.
Let Felix know that it’s working out.
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing
# make BED
plink \
--vcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--chr-set 24 no-xy \
--chr 1-24 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200716
# recode for 012 transposed
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--recode A-transpose \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200716_recode012
# compress
gzip -k plink/20200716_panel_no-sibs_line-ids_no-missing/20200716_recode012.traw
gzip -k plink/20200716_panel_no-sibs_line-ids_no-missing/20200716.bim
# send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200716_recode012.traw.gz ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200716.bim.gz ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200716.bed ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200716.fam ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set
# decompress
gunzip ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716_recode012.traw.gz
gunzip ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716.bim.gz
library(gaston)
library(tidyverse)
── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.2 ✓ purrr 0.3.4
✓ tibble 3.0.4 ✓ dplyr 1.0.2
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.0
── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
# try with read.bed.matrix
mikk_full <- gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716",
rds = NULL)
Error in gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716", :
file /Users/brettell/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716.bed not found
seq 1 24 > tmp1.txt
grep ">" refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa | scut -f6 -d":" | head -24 > tmp2.txt
paste tmp1.txt tmp2.txt > mikk_genome/data/Oryzias_latipes.ASM223467v1.dna.toplevel.fa_chr_counts.txt
chr_counts <- readr::read_tsv("~/Documents/Repositories/mikk_genome/data/Oryzias_latipes.ASM223467v1.dna.toplevel.fa_chr_counts.txt",
col_names = c("chr", "length"))
# get vector of seeds
set.seed(5)
seeds <- sample(seq(1, 100), 24)
# run over list
counter <- 0
test <- lapply(mikk_maf_filts$`0.03`, function(chr){
counter <<- counter + 1
# get seed
set.seed(seeds[counter])
# get target indices for subsample
targ_ind <- sort(sample(nrow(chr@snps), 1000))
# convert to logical vector
log_vec <- seq(1, nrow(chr@snps)) %in% targ_ind
# thin BED for sample SNPs
thin_bed <- gaston::LD.thin(chr, threshold = 1, max.dist = 50000, which.snps = log_vec)
# compute LD
ld.x <- gaston::LD(thin_bed, c(1,ncol(thin_bed)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
LD.plot(ld.x,
snp.positions = thin_bed@snps$pos,
max.dist = 1000000,
write.ld = NULL,
write.snp.id = F,
pdf.file = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/20200717_ld_maf-0.03/",
"20200717_chr",
counter,
".pdf",
sep = ""))
})
K <- GRM(mikk_full)
heatmap(K)
# get vector of seeds
set.seed(10)
seeds <- sample(seq(1, 100), 24)
# run over list
counter <- 0
test <- lapply(mikk_maf_filts$`0.05`, function(chr){
counter <<- counter + 1
# get seed
set.seed(seeds[counter])
# get target indices for subsample
targ_ind <- sort(sample(nrow(chr@snps), 1000))
# convert to logical vector
log_vec <- seq(1, nrow(chr@snps)) %in% targ_ind
# thin BED for sample SNPs
thin_bed <- gaston::LD.thin(chr, threshold = 1, max.dist = 50000, which.snps = log_vec)
# compute LD
ld.x <- gaston::LD(thin_bed, c(1,ncol(thin_bed)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
LD.plot(ld.x,
snp.positions = thin_bed@snps$pos,
max.dist = 1000000,
write.ld = NULL,
write.snp.id = F,
pdf.file = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/20200717_ld_maf-0.05/",
"20200717_chr",
counter,
".pdf",
sep = ""))
})
Calculate LD blocks with Plink
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200717_blocks
for i in $(seq 1 24); do \
bsub -M 30000 -o log/20200717_blocks_$i.out -e log/20200717_blocks_$i.err \
"plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--chr $i \
--blocks no-pheno-req no-small-max-span \
--chr-set 24 no-xy \
--blocks-max-kb 1000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200717_blocks/$i"; \
done
# example for chr 5
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-5.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-5.info \
-maxDistance 1000
# did for chrs 5, 6, 16, 17
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-16.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-16.info \
-maxDistance 1000
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-17.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-17.info \
-maxDistance 1000
Works OK for chrs 5 and 17… • 5: Block 24 (363 Kb). IDs 3526 to 3631 (28418434:28778410) • 17: Blocks 18-36 (4 Mb). IDs 2549 to 3521 (15571645:19618234)
But not for 6 and 16. Try creating a less-thinned PED using the full BED that is sitting on the local
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716 \
--recode HV \
--maf 0.05 \
--double-id \
--chr-set 24 no-xy \
--chr 6 \
--thin 0.05 \
--allele1234 \
--out ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.05/20200721
for i in $(find ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.05/20200721*.ped); do sed -i 's/\*/0/g' $i; done
sed -i '.bak' 's/\*/0/g' ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.05/20200721.chr-6.ped
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.05/20200721.chr-6.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.05/20200721.chr-6.info \
-maxDistance 1000
# Thin 0.075
mkdir ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716 \
--allele1234 \
--recode HV \
--maf 0.05 \
--double-id \
--chr-set 24 no-xy \
--chr 6,16 \
--thin 0.075 \
--out ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721
for i in $(find ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721*.ped); do sed -i '.bak' 's/\*/0/g' $i; done
java -Xms6G -Xmx6G -jar ~/Documents/Software/Haploview.jar \
-memory 5000 \
-ldvalues RSQ \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721.chr-6.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721.chr-6.info \
-maxDistance 1000 \
-compressedpng \
-nogui
#Exception in thread "main" java.lang.NoSuchMethodError: java.lang.System.runFinalizersOnExit(Z)V
• 6: From 0.05: Block 112-129 (1.9 Mb). IDs 7693 to 8249 (30372030:32229289) From 0.075: Block 202-224. IDs 11671 to 12491 (30417868:32245570) • 16: From 0.075: N/A. Can’t identify. Try with LD colour scheme
java -Xms6G -Xmx6G -jar ~/Documents/Software/Haploview.jar \
-memory 5000 \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721.chr-16.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721.chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ
# No difference
# recode for HV
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716 \
--recode HV \
--double-id \
--chr-set 24 no-xy \
--allele1234 \
--out ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv
## And do it on the cluster too
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--recode HV \
--double-id \
--chr-set 24 no-xy \
--allele1234 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview
# remove asterisks on cluster
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview*.ped); do sed -i 's/\*/0/g' $i; done
# remove asterisks on local
for i in $(find ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv*.ped); do sed -i '.bak' 's/\*/0/g' $i; done
rm ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv*.bak
# run on chr 16 directly
java -Xms10G -Xmx10G -jar ~/Documents/Software/Haploview.jar \
-memory 8000 \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv.chr-16.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv.chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-spacing 0.925
# Try with different parameters
java -Xms15G -Xmx15G -jar ~/Documents/Software/Haploview.jar \
-memory 14000 \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv.chr-16.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv.chr-16.info \
-maxDistance 50 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-spacing 0.99
# try running on cluster
java -Xms20G -Xmx20G -jar /nfs/software/birney/Haploview.jar \
-memory 18000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview.chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview.chr-16.info \
-maxDistance 100 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-spacing 0.95 \
-nogui \
-compressedpng
# Writing output to 20200722_haploview.chr-16.ped.LD.PNG
# Fatal Error:
# Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
java -Xms45G -Xmx45G -jar /nfs/software/birney/Haploview.jar \
-memory 40000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview.chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview.chr-16.info \
-maxDistance 100 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-spacing 0.95 \
-nogui \
-compressedpng
# Try thinning to different degrees with plink
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1500
# 1000
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 1000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-$i;
done
## remove asterisks
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-*.ped); do sed -i 's/\*/0/g' $i; done
# Plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_1000/16
# And use this site to convert to PDF! https://www.zamzar.com/ (actually this one is free and works as well: tps://onlineconvertfree.com/)
# Try with 3000
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 3000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-$i;
done
## remove asterisks
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-*.ped); do sed -i 's/\*/0/g' $i; done
# plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_3000/16
# try with 10000 and filter for MAF of 0.05 when creating plink HVs
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--maf 0.05 \
--thin-count 10000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/10000/20200723_chr-$i;
done
# It filters for MAF AFTER thinning. Create new BED set filtered by MAF
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_maf-0.05
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--make-bed \
--double-id \
--chr-set 24 no-xy \
--maf 0.05 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_maf-0.05/20200723
# Leaves 4.36M SNPs
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_maf-0.05/20200723 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 10000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/10000/20200723_chr-$i;
done
## remove asterisks
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/10000/20200723_chr-*.ped); do sed -i 's/\*/0/g' $i; done
## Plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/10000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/10000/20200723_chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_10000/16
# Takes a while. Try with 1000
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_maf-0.05/20200723 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 1000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-$i;
done
## remove asterisks
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-*.ped); do sed -i 's/\*/0/g' $i; done
# Plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_1000/16
# Works well. Let's try with 3000
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_maf-0.05/20200723 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 3000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-$i;
done
## remove asterisks
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-*.ped); do sed -i 's/\*/0/g' $i; done
# Plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_3000/16
# Looks good. Probably about the limit.
# Try adding BP position to the plots
awk -v OFS="\t" {'print $2,$2'} plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.info > plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.info.test
# Plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.info.test \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_1000/16_withID
# WORKS. Do for all in 3000
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr*.info); do
outname=$(echo $i\_with-id);
awk -v OFS="\t" {'print $2,$2'} $i > $outname;
done
# Plot all
for i in $(seq 1 24); do
bsub -M 20000 -o log/20200723_hv_3000_$i.out -e log/20200723_hv_3000_$i.err \
"java -Xms18G -Xmx18G -jar /nfs/software/birney/Haploview.jar \
-memory 18000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-$i.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-$i.info_with-id \
-maxDistance 1000 \
-ldcolorscheme DEFAULT \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_3000/$i";
done
# Do for 1000 as well
for i in $(seq 1 24); do
bsub -M 20000 -o log/20200723_hv_1000_$i.out -e log/20200723_hv_1000_$i.err \
"java -Xms18G -Xmx18G -jar /nfs/software/birney/Haploview.jar \
-memory 18000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-$i.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-$i.info_with-id \
-maxDistance 1000 \
-ldcolorscheme DEFAULT \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_1000/$i";
done
# copy to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plots/20200723_ld_thinned_1000/* /Users/brettell/Documents/Docs/medaka\ pics/20200602_mikk_genome/20200723_haploview/20200723_1000/
svgs
# Use this online tool to convert: https://onlineconvertfree.com/convert-format/svg-to-pdf/
Final PDFs here: ~/Documents/Docs/medaka\ pics/20200602_mikk_genome/20200723_haploview
Based on these plots, the LD blocks to investigate are here:
• 5:28181970-28970558 (788 Kb) • 6:29398579-32246747 (2.85 Mb) • 12:25336174-25384053 (48 Kb) • 14:12490842-12947083 (456 Kb) • 17:15557892-19561518 (4 Mb) • 21:6710074-7880374 (1.17 Mb)
[20200803 Make plots for MAF > 0.10]
# Make new BED for MAF > 0.03
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_maf-0.03/
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--make-bed \
--double-id \
--chr-set 24 no-xy \
--maf 0.03 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_maf-0.03/20200803
# Leaves 5,879,255 SNPs
# Make new BED for MAF > 0.10
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_maf-0.10/
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--make-bed \
--double-id \
--chr-set 24 no-xy \
--maf 0.1 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_maf-0.10/20200803
# Leaves 2,968,786 SNPs
# Recode
## 0.03
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_hv_thinned/
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_hv_thinned/0.03
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_hv_thinned/0.10
for h in $(echo 0.03 0.10); do
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_maf-$h/20200803 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 3000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_hv_thinned/$h/20200803_chr-$i;
done;
done
# Edit .ped files to remove asterisks
for h in $(echo 0.03 0.10); do
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_hv_thinned/$h/20200803_chr-*.ped); do
sed -i 's/\*/0/g' $i;
done;
done
# Edit .info files to make the SNP's bp position its ID
for h in $(echo 0.03 0.10); do
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_hv_thinned/$h/20200803_chr*.info); do
outname=$(echo $i\_with-id);
awk -v OFS="\t" {'print $2,$2'} $i > $outname;
done;
done
# Plot
mkdir plots/20200803_ld_thinned/
mkdir plots/20200803_ld_thinned/0.03
mkdir plots/20200803_ld_thinned/0.10
for h in $(echo 0.03 0.10); do
for i in $(seq 1 24); do
bsub -M 20000 -o log/20200803_hv_$h\_$i.out -e log/20200803_hv_$h\_$i.err \
"java -Xms18G -Xmx18G -jar /nfs/software/birney/Haploview.jar \
-memory 18000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_hv_thinned/$h/20200803_chr-$i.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200803_hv_thinned/$h/20200803_chr-$i.info_with-id \
-maxDistance 1000 \
-ldcolorscheme DEFAULT \
-ldvalues RSQ \
-minMAF $h \
-nogui \
-svg \
-out plots/20200803_ld_thinned/$h/$i";
done;
done
# copy to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plots/20200803_ld_thinned/ ~/Documents/Docs/medaka\ pics/20200602_mikk_genome/20200723_haploview/
# convert to pdf
# https://www.zamzar.com/ for files > 30 MB (chr 1) - note limit on number of files you can convert
# https://onlineconvertfree.com/convert-format/svg-to-pdf/ for the rest
# push to Gdrive
rclone copy ~/Documents/Docs/medaka\ pics/20200602_mikk_genome/20200723_haploview/20200803_pdfs google_drive_indigene:MIKK\ panel\ genome\ paper/Results/MAF\ \&\ LD/LD\ blocks\ \(Haploview\)/3000\ SNPs
[end 20200803]
trio.BiocManager::install("trio")
library(trio)
gaston to create the matrixlibrary(gaston)
# try with read.bed.matrix
mikk_full <- gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716",
rds = NULL)
# split by chromosome
mikk_full_split <- lapply(seq(1, 24), function(x){
gaston::select.snps(mikk_full, chr == x)
})
names(mikk_full_split) <- seq(1, 24)
# make list with filtered bed.matrixes based on MAF
## create vector of target MAFs
target_mafs <- c(0.03, 0.05)
## filter
mikk_maf_filts <- lapply(target_mafs, function(x){
lapply(mikk_full_split, function(y){
gaston::select.snps(y, maf > x)
})
})
names(mikk_maf_filts) <- target_mafs
trio requires genotype matrix so import
# make PED from BED
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716 \
--recode \
--out ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716
# PED is 4.13GB. Delete.
test <- trio::read.pedfile("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716.ped",
p2g = T,
non.rs.IDs = T)
## Takes too long.
Try using snpStats
library(snpStats)
mikk_snpstats <- snpStats::read.plink("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716")
#non-unique value when setting 'row.names': ‘.’Error in `.rowNamesDF<-`(x, value = value) : duplicate 'row.names' are not allowed
Back to manual
# read in data
mikk_geno <- readr::read_tsv(file = "~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716_recode012.traw",
progress = T,
col_names = T)
# split
mikk_geno_split <- split(mikk_geno, f = mikk_geno$CHR)
# remove redundant columns, set up col and row names, and convert to matrix
counter <- 0
mikk_geno_split <- lapply(mikk_geno_split, function(x){
counter <<- counter + 1
x <- x[, -(1:6)]
colnames(x) <- mikk_full@ped$id
x <- as.matrix(x)
rownames(x) <- paste(mikk_full_split[[counter]]@snps$chr,
mikk_full_split[[counter]]@snps$pos,
sep = ":")
return(x)
})
# get SNPs with MAF > 0.05
target_snps <- which(mikk_full_split$`6`@snps$maf > 0.05)
# pull out those SNPs
mikk_6 <- mikk_geno_split[["6"]][target_snps, ]
#test <- trio::findLDblocks(mikk_6,
# snp.in.col = F)
# Error in trio::findLDblocks(mikk_geno_split[["6"]], snp.in.col = F) : If x is a matrix, x must contain at most 500 SNPs. If x contains more SNPs, please consider to use getLDlarge.
test <- trio::getLDlarge(mikk_6,
neighbors = 25,
snp.in.col = F,
addVarN = T)
test2 <- trio::findLDblocks(test)
# takes hours to run. Try with fewer neighbours
test <- trio::getLDlarge(mikk_6,
neighbors = 10,
snp.in.col = F,
addVarN = T)
test2 <- trio::findLDblocks(test)
LDExplorerinstall.packages("LDExplorer", repos="http://R-Forge.R-project.org")
library(LDExplorer)
# get vector of seeds
set.seed(10)
seeds <- sample(seq(1, 100), 24)
# run over list
counter <- 0
test <- lapply(mikk_maf_filts$`0.05`[6], function(chr){
counter <<- counter + 1
# get seed
set.seed(seeds[counter])
# set boundaries
bounds <- c(25000000, max(chr@snps$pos))
# get sample of 1000
targ_ind <- sort(sample(which(dplyr::between(chr@snps$pos, bounds[1], bounds[2])), 1000))
# convert into logical vector
log_vec <- seq(1, nrow(chr@snps)) %in% targ_ind
# thin BED for sample SNPs
thin_bed <- gaston::LD.thin(chr,
threshold = 1,
max.dist = 50000,
which.snps = log_vec,
dist.unit = "bases")
# compute LD
ld.x <- gaston::LD(thin_bed, c(1,ncol(thin_bed)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
LD.plot(ld.x,
snp.positions = thin_bed@snps$pos,
max.dist = 1000000,
write.ld = NULL,
write.snp.id = F,
pdf.file = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/20200717_ld_maf-0.05/zoomed/",
"20200717_chr6_25Mb-end",
".pdf",
sep = "")
)
})
Not easy to identify. Import Plink blocks
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200717_blocks/*det ~/Documents/Data/20200707_mikk_ld/20200717_plink_blocks
block_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200717_plink_blocks", full.names = T)
file_names <- gsub(".blocks.det", "", basename(block_files))
block_lst <- lapply(block_files, function(x){
readr::read_delim(x,
delim = " ",
col_names = T,
trim_ws = T)
})
names(block_lst) <- file_names
# reorder
block_lst <- block_lst[order(as.integer(names(block_lst)))]
chr6 <- block_lst$`6`
chr6 <- chr6[order(chr6$KB, decreasing = T), ]
# make sequence of 50 kb intervals... too large. Too many hits within each interval. Try 5kb
ints <- seq(1, chr_counts$length[chr_counts$chr == "6"], by = 1000)
#ggplot2::cut_width(c(1, chr6$BP1, chr_counts$length[chr_counts$chr == "6"]), width = 50000)
# create vector of intervals
chr6_bp1 <- ggplot2::cut_interval(c(1, chr6$BP1, chr_counts$length[chr_counts$chr == "6"]),
length = 1000,
labels = F)
# remove first and last entries
chr6_bp1 <- chr6_bp1[2:(length(chr6_bp1) - 1)]
# test for difference in number between ints and unique values of chr6_bp1
length(unique(chr6_bp1))
# use ifelse to create vector of values of either 0 or 100
chr6_yesno <- ifelse(seq(1, length(ints)) %in% chr6_bp1, 100, 0)
# bind both into data frame
chr6_df <- data.frame(ints,
"yesno" = chr6_yesno)
# plot
ggplot(data = chr6_df) +
geom_col(aes(ints, yesno))
# find lengths of consecutive runs
chr6_df$consec <- sequence(rle(chr6_df$yesno)$lengths)
# find highest
max(sequence(rle(chr6_df$yesno)$lengths))
# find which one
which(sequence(rle(chr6_df$yesno)$lengths) == 38)
# that gathers consecutive runs of 0 as well. We want the consecutive runs of 100
chr6_df$consec <- sequence(rle(chr6_df$yesno == 100)$lengths)
# find maximum of 100s
max(chr6_df$consec[chr6_df$yesno == 100])
hist(chr6_df$consec[chr6_df$yesno == 100])
table(chr6_df$consec[chr6_df$yesno == 100])
#
The appearance of clusters changes depending on how zoomed-in you are to the plot.
20200721
Try running Plink --blocks agains with stricter thresholds
Note also that I raised the max blocks KB range to 10MB
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200721_blocks_lowci_0.85
for i in $(seq 1 24); do \
bsub -M 30000 -o log/20200721_blocks_$i.out -e log/20200721_blocks_$i.err \
"plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--chr $i \
--blocks no-pheno-req no-small-max-span \
--blocks-strong-lowci 0.8 \
--chr-set 24 no-xy \
--blocks-max-kb 10000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200721_blocks_lowci_0.85/$i"; \
done
Import Plink blocks
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200721_blocks_lowci_0.85/*det ~/Documents/Data/20200707_mikk_ld/20200721_plink_blocks
block_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200721_plink_blocks", full.names = T)
file_names <- gsub(".blocks.det", "", basename(block_files))
block_lst <- lapply(block_files, function(x){
readr::read_delim(x,
delim = " ",
col_names = T,
trim_ws = T)
})
names(block_lst) <- file_names
# reorder
block_lst <- block_lst[order(as.integer(names(block_lst)))]
test <- block_lst[["6"]]
test <- test[order(test$KB, decreasing = T), ]
library(gaston)
# try with read.bed.matrix
mikk_full <- gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716",
rds = NULL)
# split by chromosome
mikk_full_split <- lapply(seq(1, 24), function(x){
gaston::select.snps(mikk_full, chr == x)
})
names(mikk_full_split) <- seq(1, 24)
# make list with filtered bed.matrixes based on MAF
## create vector of target MAFs
target_mafs <- c(0.03, 0.05)
## filter
mikk_maf_filts <- lapply(target_mafs, function(x){
lapply(mikk_full_split, function(y){
gaston::select.snps(y, maf > x)
})
})
names(mikk_maf_filts) <- target_mafs
data.frame("maf" = mikk_full@snps$maf) %>%
ggplot() +
geom_histogram(aes(x = maf,
y=..count../1000000),
bins = 40,
fill = "#2A9D8F",
colour = "#264653") +
theme_bw() +
guides(fill = F) +
xlab("Minor allele frequencies") +
ylab("Count (in millions of sites)")
ggsave(filename = paste("20200723_maf_freqs", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
cp /hps/research1/birney/users/ian/rac_hyp/vcfs/1gk_all.vcf.gz* /hps/research1/birney/users/ian/mikk_paper/vcfs
# needs --recode flag!!
vcftools \
--gzvcf vcfs/1gk_all.vcf.gz \
--max-missing 1 \
--recode \
--stdout > vcfs/1gk_all_no-missing.vcf
# Made a HUGE file - 298GB and only up to chr 6!
# VCF has 84,739,838 variants!
# Try different tack.
## compress
#bcftools view --output-type z --output-file vcfs/1gk_all_no-missing.vcf.gz vcfs/1gk_all_no-missing.vcf
## create index
#bcftools index --tbi vcfs/1gk_all_no-missing.vcf.gz
mkdir plink/20200723_1gk_no-missing_maf-0.05
# make BED
plink \
--vcf vcfs/1gk_all.vcf.gz \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--maf 0.05 \
--geno 0 \
--out plink/20200723_1gk_no-missing_maf-0.05/20200723
# 7110223 SNPs remaining
# make another with no MAF limit
mkdir plink/20200723_1gk_no-missing
plink \
--vcf vcfs/1gk_all.vcf.gz \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--geno 0 \
--out plink/20200723_1gk_no-missing/20200723
mkdir ld/20200724_1gk_maf-0.10_window-50kb_no-missing/
for i in $(seq 1 21); do
plink \
--bfile plink/20200723_1gk_no-missing_maf-0.05/20200723 \
--r2 \
--ld-window-kb 50 \
--chr $i \
--maf 0.10 \
--out ld/20200724_1gk_maf-0.10_window-50kb_no-missing/$i;
done
New script here: mikk_genome/code/scripts/20200724_r2_decay_mean_1gk.R
mkdir ld/20200724_mean_r2_1gk
# TEST
Rscript --vanilla mikk_genome/code/scripts/20200724_r2_decay_mean_1gk.R \
ld/20200724_1gk_maf-0.10_window-50kb_no-missing/22.ld \
ld/20200724_mean_r2_1gk
# WORKS
# TRUE
for i in $(find ld/20200724_1gk_maf-0.10_window-50kb_no-missing/*.ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 10000 -o log/20200724_$name\_mean-r2.out -e log/20200724_$name\_mean-r2.err "Rscript --vanilla mikk_genome/code/scripts/20200724_r2_decay_mean_1gk.R $i ld/20200724_mean_r2_1gk";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200724_mean_r2_1gk/ ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200724_mean_r2_1gk/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200724_mean_r2_1gk/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df <- dplyr::bind_rows(data_list, .id = "chr")
r2_df$chr <- factor(r2_df$chr, levels = seq(1, 22))
# get kb measure
r2_df$bin_bdr_kb <- r2_df$bin_bdr / 1000
r2_df %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome")
ggsave(filename = paste("20200724_mean-r2_1kg", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_df %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (Kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~chr, nrow = 6, ncol = 4) +
guides(colour = F) +
theme(axis.text = element_text(size = 8),
strip.text = element_text(size = 8),
panel.grid = element_blank())
ggsave(filename = paste("20200724_mean-r2_facet_50kb_1KG", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 20,
units = "cm",
dpi = 500)
mkdir ld/20200724_mean_r2_1kb-max_1gk
# TRUE
for i in $(find ld/20200724_1gk_maf-0.10_window-50kb_no-missing/*.ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 10000 -o log/20200724_$name\_mean-r2_1kb-max.out -e log/20200724_$name\_mean-r2_1kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200724_r2_decay_mean_1gk_1kb-lim.R $i ld/20200724_mean_r2_1kb-max_1gk";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200724_mean_r2_1kb-max_1gk ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200724_mean_r2_1kb-max_1gk/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200724_mean_r2_1kb-max_1gk/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_1gk <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_1gk$chr <- factor(r2_df_1kb_1gk$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb_1gk$bin_bdr_kb <- r2_df_1kb_1gk$bin_bdr / 1000
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2_1kb-max/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2_1kb-max/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_mikk <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_mikk$chr <- factor(r2_df_1kb_mikk$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb_mikk$bin_bdr_kb <- r2_df_1kb_mikk$bin_bdr / 1000
r2_df_1kb_1gk %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome")
ggsave(filename = paste("20200724_mean-r2_1kb-lim_1KG", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_df_1kb_1gk %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
guides(colour = F)
ggsave(filename = paste("20200724_mean-r2_1kb-lim_no-guide_1KG", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_final_lst <- list("1KG" = r2_df_1kb_1gk,
"MIKK" = r2_df_1kb_mikk)
r2_final_df <- dplyr::bind_rows(r2_final_lst, .id = "dataset")
r2_final_df %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = dataset)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~chr, nrow = 6, ncol = 4) +
theme(axis.text = element_text(size = 8),
strip.text = element_text(size = 8),
panel.grid = element_blank()) +
scale_color_manual(values = c("#FC4E07", "#360568")) +
labs(colour = "")
ggsave(filename = paste("20200724_mean-r2_1kb-lim_1KGvMIKK", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
library(gaston)
Loading required package: Rcpp
Loading required package: RcppParallel
Attaching package: 'RcppParallel'
The following object is masked from 'package:Rcpp':
LdFlags
Gaston set number of threads to 2. Use setThreadOptions() to modify this.
Attaching package: 'gaston'
The following object is masked from 'package:stats':
sigma
The following objects are masked from 'package:base':
cbind, rbind
# try with read.bed.matrix
mikk_full <- gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716",
rds = NULL)
Error in gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716", :
file /Users/brettell/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716.bed not found
mikk_geno <- readr::read_tsv(file = "~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716_recode012.traw",
progress = T,
col_names = T)
# rename IDs
colnames(mikk_geno)[7:length(colnames(mikk_geno))] <- mikk_full@ped$id
Coordinates: 17:15559963-19553529
# find indexes for SNPs with MAF
target_inds <- which(mikk_full@snps$chr == 17 &
dplyr::between(mikk_full@snps$pos, 15559963, 19553529) &
mikk_full@snps$maf > 0.05)
chr17_ld <- mikk_geno[target_inds, ]
library(biclust)
# make matrix
chr_17_ld_mat <- as.matrix(chr17_ld[, -(1:6)])
# bicluster
biclust::biclust(chr_17_ld_mat, method = BCCC(), delta = 2, alpha = 1)
# 1 cluster found with 37312 Rows and 63 columns
biclust::biclust(t(chr_17_ld_mat), method = BCXmotifs())
# There was one cluster found with 63 Rows and 16631 columns
biclust_out <- biclust::biclust(chr_17_ld_mat, method = BCPlaid(),
cluster = "c",)
# Number of Clusters found: 10
# make heatmap
biclust::drawHeatmap(chr_17_ld_mat[1:1000, ], )
biclust::drawHeatmap2(chr_17_ld_mat[1:1000, ], )
biclust::drawHeatmap(chr_17_ld_mat, biclust_out)
biclust::biclust(chr_17_ld_mat, method = BCSpectral())
# no cluster found
# plot
biclust::drawHeatmap(chr_17_ld_mat,
bicResult = bicl_out_17)
heatmap(chr_17_ld_mat[1:1000, ],
Rowv = NA)
heatmap(chr_17_ld_mat, Rowv = NA)
# save
png(filename = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/",
"20200724_heatmap_chr17",
".png",
sep = ""),
width = 40,
height = 40,
units = "cm",
res = 500)
heatmap(chr_17_ld_mat, Rowv = NA)
dev.off()
Try with heatmaply
library(heatmaply)
heatmaply::ggheatmap(chr_17_ld_mat[1:1000, ],Rowv = NULL)
# get coordinates
high_ld_chrs <- c(5, 6, 12, 14, 17, 21)
high_ld_start <- c(28385805, 29608514, 25340000, 12584614, 15559963, 6800261)
high_ld_end <- c(28798048, 32212235, 25372985, 12861147, 19553529, 7760258)
# build into list
counter <- 0
high_ld_lst <- lapply(high_ld_chrs, function(x){
counter <<- counter + 1
x <- list("chr" = x,
"start" = high_ld_start[counter],
"end" = high_ld_end[counter])
# find indexes for SNPs with MAF > 0.05
x[["target_inds"]] <- which(mikk_full@snps$chr == x[["chr"]] &
dplyr::between(mikk_full@snps$pos, x[["start"]], x[["end"]]) &
mikk_full@snps$maf > 0.05)
x[["target_snps"]] <- mikk_geno[x[["target_inds"]], ]
# make matrix
x[["geno_mat"]] <- as.matrix(x[["target_snps"]][, -(1:6)])
return(x)
})
names(high_ld_lst) <- high_ld_chrs
counter <- 0
lapply(high_ld_lst, function(x){
counter <<- counter + 1
png(filename = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/",
"20200727_heatmap_chr",
names(high_ld_lst)[counter],
"low_res",
".png",
sep = ""),
width = 40,
height = 40,
units = "cm",
res = 300)
heatmap(x[["geno_mat"]], Rowv = NA)
dev.off()
})
# make BED
mkdir plink/20200727_mikk_no-missing_maf-0.05
plink \
--vcf vcfs/panel_no-sibs_line-ids.vcf.gz \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--maf 0.05 \
--geno 0 \
--chr-set 24 no-xy \
--out plink/20200727_mikk_no-missing_maf-0.05/20200727
# get LD
mkdir ld/20200727_mikk_maf-0.10_window-50kb_no-missing/
for i in $(seq 1 24); do
plink \
--bfile plink/20200727_mikk_no-missing_maf-0.05/20200727 \
--r2 \
--ld-window 999999 \
--ld-window-kb 50 \
--ld-window-r2 0 \
--chr-set 24 no-xy \
--chr $i \
--maf 0.10 \
--out ld/20200727_mikk_maf-0.10_window-50kb_no-missing/$i;
done
# for 1KG too
mkdir ld/20200727_1kg_maf-0.10_window-50kb_no-missing/
for i in $(seq 1 22); do
plink \
--bfile plink/20200723_1gk_no-missing_maf-0.05/20200723 \
--r2 \
--ld-window 999999 \
--ld-window-kb 50 \
--ld-window-r2 0 \
--chr $i \
--maf 0.10 \
--out ld/20200727_1kg_maf-0.10_window-50kb_no-missing/$i;
done
# do again with ld-window-kb 10 to get counts of comparisons for paper
# get LD
mkdir ld/20200803_mikk_maf-0.10_window-10kb_no-missing/
for i in $(seq 1 24); do
plink \
--bfile plink/20200727_mikk_no-missing_maf-0.05/20200727 \
--r2 \
--ld-window 999999 \
--ld-window-kb 10 \
--ld-window-r2 0 \
--chr-set 24 no-xy \
--chr $i \
--maf 0.10 \
--out ld/20200803_mikk_maf-0.10_window-10kb_no-missing/$i;
done
# for 1KG too
mkdir ld/20200803_1kg_maf-0.10_window-10kb_no-missing/
for i in $(seq 1 22); do
plink \
--bfile plink/20200723_1gk_no-missing_maf-0.05/20200723 \
--r2 \
--ld-window 999999 \
--ld-window-kb 10 \
--ld-window-r2 0 \
--chr $i \
--maf 0.10 \
--out ld/20200803_1kg_maf-0.10_window-10kb_no-missing/$i;
done
# Get counts
wc -l ld/20200803_mikk_maf-0.10_window-10kb_no-missing/*.ld
# Total:
wc -l ld/20200803_1kg_maf-0.10_window-10kb_no-missing/*.ld
Rscript here: mikk_genome/code/scripts/20200727_r2_decay_mean_1gk_10kb-lim.R
mkdir ld/20200727_mean_r2_10kb-lim_mikk
# TRUE
for i in $(find ld/20200727_mikk_maf-0.10_window-50kb_no-missing/*.ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 10000 -o log/20200727_$name\_mean-r2_1kb-max.out -e log/20200727_$name\_mean-r2_1kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200727_r2_decay_mean_1gk_10kb-lim.R $i ld/20200727_mean_r2_10kb-lim_mikk";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200727_mean_r2_10kb-lim_mikk ~/Documents/Data/20200707_mikk_ld/
mkdir ld/20200727_mean_r2_10kb-lim_1kg
# TRUE
for i in $(find ld/20200727_1kg_maf-0.10_window-50kb_no-missing/*.ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 30000 -o log/20200727_$name\_mean-r2_10kb-max.out -e log/20200727_$name\_mean-r2_10kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200727_r2_decay_mean_1gk_10kb-lim.R $i ld/20200727_mean_r2_10kb-lim_1kg";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200727_mean_r2_10kb-lim_1kg ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200727_mean_r2_10kb-lim_1kg/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200727_mean_r2_10kb-lim_1kg/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_1kg <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_1kg$chr <- factor(r2_df_1kb_1kg$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb_1kg$bin_bdr_kb <- r2_df_1kb_1kg$bin_bdr / 1000
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200727_mean_r2_10kb-lim_mikk/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200727_mean_r2_10kb-lim_mikk")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_mikk <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_mikk$chr <- factor(r2_df_1kb_mikk$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb_mikk$bin_bdr_kb <- r2_df_1kb_mikk$bin_bdr / 1000
r2_df_1kb_1kg %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (Kb)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome")
ggsave(filename = paste("20200727_mean-r2_10kb-lim_1KG", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_df_1kb_1kg %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
guides(colour = F)
ggsave(filename = paste("20200727_mean-r2_1kb-lim_no-guide_1KG", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_df_1kb_mikk %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
guides(colour = F)
r2_final_lst <- list("1KG" = r2_df_1kb_1kg,
"MIKK" = r2_df_1kb_mikk)
r2_final_df <- dplyr::bind_rows(r2_final_lst, .id = "dataset")
r2_final_df %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = dataset)) +
theme_bw() +
xlab("Distance between SNPs (Kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~chr, nrow = 6, ncol = 4) +
theme(axis.text = element_text(size = 8),
strip.text = element_text(size = 8),
panel.grid = element_blank()) +
scale_color_manual(values = c("#FC4E07", "#360568")) +
labs(colour = "")
ggsave(filename = paste("20200727_mean-r2_10kb-lim_1KGvMIKK", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_final_df %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance between SNPs (Kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~dataset, nrow = 1, ncol = 2) +
theme(panel.grid = element_blank()) +
labs(colour = "Chromosome")
ggsave(filename = paste("20200727_mean-r2_10kb-lim_1KGvMIKK_single", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 25,
height = 13,
units = "cm",
dpi = 500)
# 20200803 new
r2_final_df %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance between SNPs (kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~dataset, nrow = 1, ncol = 2) +
theme(panel.grid = element_blank()) +
labs(colour = "Chromosome") +
scale_y_continuous(breaks = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6),
limits = c(0.05, 0.6))
ggsave(filename = paste("20200803_mean-r2_10kb-lim_1KGvMIKK_single", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 25,
height = 13,
units = "cm",
dpi = 500)
20200803
Do with 10 kb bins so it’s the same as the inset.
Rscript here: mikk_genome/code/scripts/20200803_r2_decay_mean_1gk_10kb-lim.R
mkdir ld/20200803_mean_r2_10kb-lim_mikk
# TRUE
for i in $(find ld/20200727_mikk_maf-0.10_window-50kb_no-missing/*.ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 30000 -o log/20200727_$name\_mean-r2_1kb-max.out -e log/20200727_$name\_mean-r2_1kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200803_r2_decay_mean_1gk_10kb-lim.R $i ld/20200803_mean_r2_10kb-lim_mikk";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200803_mean_r2_10kb-lim_mikk ~/Documents/Data/20200707_mikk_ld/
mkdir ld/20200803_mean_r2_10kb-lim_1kg
# TRUE
for i in $(find ld/20200727_1kg_maf-0.10_window-50kb_no-missing/*.ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 30000 -o log/20200727_$name\_mean-r2_10kb-max.out -e log/20200727_$name\_mean-r2_10kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200803_r2_decay_mean_1gk_10kb-lim.R $i ld/20200803_mean_r2_10kb-lim_1kg";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200803_mean_r2_10kb-lim_1kg ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200803_mean_r2_10kb-lim_1kg/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200803_mean_r2_10kb-lim_1kg/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_1kg <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_1kg$chr <- factor(r2_df_1kb_1kg$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb_1kg$bin_bdr_kb <- r2_df_1kb_1kg$bin_bdr / 1000
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200803_mean_r2_10kb-lim_mikk/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200803_mean_r2_10kb-lim_mikk")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_mikk <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_mikk$chr <- factor(r2_df_1kb_mikk$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb_mikk$bin_bdr_kb <- r2_df_1kb_mikk$bin_bdr / 1000
r2_final_lst <- list("1KG" = r2_df_1kb_1kg,
"MIKK" = r2_df_1kb_mikk)
r2_final_df <- dplyr::bind_rows(r2_final_lst, .id = "dataset")
# 20200803 new
r2_final_df %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance between SNPs (kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~dataset, nrow = 1, ncol = 2) +
theme(panel.grid = element_blank()) +
labs(colour = "Chromosome") +
scale_y_continuous(breaks = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7),
limits = c(0.05, 0.7))
Too noisy. Don’t use.
mkdir ld/20200727_mean_r2_1kb-lim_mikk
# TRUE
for i in $(find ld/20200727_mikk_maf-0.10_window-50kb_no-missing/*.ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 10000 -o log/20200727_$name\_mean-r2_1kb-max.out -e log/20200727_$name\_mean-r2_1kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200724_r2_decay_mean_1gk_1kb-lim.R $i ld/20200727_mean_r2_1kb-lim_mikk";
done
# 20200803... Rerun for chr 2 and 18 - didn't finish
for i in $(find ld/20200727_mikk_maf-0.10_window-50kb_no-missing/{2.,18.}*ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 30000 -o log/20200803_$name\_mean-r2_1kb-max.out -e log/20200803_$name\_mean-r2_1kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200724_r2_decay_mean_1gk_1kb-lim.R $i ld/20200727_mean_r2_1kb-lim_mikk";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200727_mean_r2_1kb-lim_mikk ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200727_mean_r2_1kb-lim_mikk",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200727_mean_r2_1kb-lim_mikk")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_mikk <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_mikk$chr <- factor(r2_df_1kb_mikk$chr, levels = seq(1, 24))
# get kb measure
#r2_df_1kb_mikk$bin_bdr_kb <- r2_df_1kb_mikk$bin_bdr / 1000
r2_df_1kb_mikk %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome") +
theme(panel.grid = element_blank(),
axis.text = element_text(size = 12),
axis.title = element_text(size = 16)) +
guides(colour = F) +
scale_y_continuous(breaks = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7),
limits = c(0.05, 0.7))
# New one
ggsave(filename = paste("20200803_mean-r2_1kb-lim_MIKK_inset", ".pdf", sep = ""),
device = "pdf",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 10.88,
height = 8,
units = "cm",
dpi = 500)
mikk_genome/code/scripts/20200803_r2_decay_mean_1gk_1kb-lim.R
mkdir ld/20200803_mean_r2_1kb-lim_mikk
for i in $(find ld/20200727_mikk_maf-0.10_window-50kb_no-missing/*ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 30000 -o log/20200803_$name\_mean-r2_1kb-max.out -e log/20200803_$name\_mean-r2_1kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200803_r2_decay_mean_1gk_1kb-lim.R $i ld/20200803_mean_r2_1kb-lim_mikk";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200803_mean_r2_1kb-lim_mikk ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200803_mean_r2_1kb-lim_mikk",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200803_mean_r2_1kb-lim_mikk")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_mikk <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_mikk$chr <- factor(r2_df_1kb_mikk$chr, levels = seq(1, 24))
# get kb measure
#r2_df_1kb_mikk$bin_bdr_kb <- r2_df_1kb_mikk$bin_bdr / 1000
r2_df_1kb_mikk %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome") +
theme(panel.grid = element_blank(),
axis.text = element_text(size = 12),
axis.title = element_text(size = 14)) +
guides(colour = F) +
scale_x_continuous(limits = c(0, 1000)) +
scale_y_continuous(breaks = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6),
limits = c(0.05, 0.6))
# New one
ggsave(filename = paste("20200803_mean-r2_1kb-lim_MIKK_inset_100bp-bins", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 10.88,
height = 8,
units = "cm",
dpi = 500)
# 1KG
plink \
--bfile plink/20200723_1gk_no-missing/20200723 \
--freq \
--out maf/20200727_1kg_no-missing
# Creates a 3.9GB file.
# MIKK
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--freq \
--chr-set 24 no-xy \
--out maf/20200727_mikk_no-missing
# Send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/maf/20200727_mikk_no-missing.frq.gz ~/Documents/Data/20200707_mikk_ld
gunzip ~/Documents/Data/20200707_mikk_ld/20200727_mikk_no-missing.frq.gz
maf_mikk <- readr::read_delim("~/Documents/Data/20200707_mikk_ld/20200727_mikk_no-missing.frq",
delim = " ",
trim_ws = T) %>%
dplyr::select(MAF)
maf_mikk$dataset <- "MIKK"
maf_1kg <- readr::read_delim("~/Documents/Data/20200707_mikk_ld/20200727_1kg_no-missing.frq",
delim = " ",
trim_ws = T,
col_types = cols_only(MAF = col_double()))
maf_1kg$dataset <- "1KG"
maf_final <- rbind(maf_mikk, maf_1kg)
# Plot
maf_final %>%
ggplot() +
geom_histogram(aes(x = MAF,
y=0.01*..density..,
fill = dataset),
binwidth = 0.01) +
theme_bw() +
guides(fill = F) +
facet_wrap(~dataset, nrow = 1, ncol = 2) +
xlab("Minor allele frequencies") +
ylab("Density") +
theme(panel.grid = element_blank()) +
scale_fill_manual(values = c(`1KG` = "#FC4E07",
MIKK = "#360568"))
ggsave(filename = paste("20200727_maf_1KGvMIKK", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 25,
height = 13,
units = "cm",
dpi = 400)
Use this script instead because it’s quicker on the cluster: mikk_genome/code/scripts/20200727_maf_plot.R
Rscript --vanilla mikk_genome/code/scripts/20200727_maf_plot.R maf/20200727_mikk_no-missing.frq maf/20200727_1kg_no-missing.frq plots 20200727_maf_MIKKv1KG
cd refs
wget ftp://ftp.ensembl.org/pub/release-100/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz
# create index using BWA
bwa index refs/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz
# Try MUMmer4: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005944
mkdir stickleback
mkdir stickleback/sams
nucmer \
--sam-long=stickleback/sams/20200721.sam \
refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
refs/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa
# convert SAM to BAM
mkdir stickleback/bams
samtools view \
-b \
-o stickleback/bams/20200721.bam \
stickleback/sams/20200721.sam
#[E::sam_parse1] missing SAM header
#[W::sam_read1] Parse error at line 3
#[main_samview] truncated file.
# sort
samtools sort \
-o stickleback/bams/20200728_stickle2hdrr_sorted.bam \
-O bam \
stickleback/sams/20200721.sam
# Try with minimap2
mkdir sams
minimap2 -ax asm5 \
refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
refs/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz \
> sams/20200728_stickle2hdrr.sam
# try with asm20
minimap2 -ax asm20 \
refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
refs/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz \
> sams/20200728_stickle2hdrr_asm20.sam
# Convert to BAM
mkdir bams
samtools view \
-b sams/20200728_stickle2hdrr.sam > bams/20200728_stickle2hdrr.bam
samtools view \
-b sams/20200728_stickle2hdrr_asm20.sam > bams/20200728_stickle2hdrr_asm20.bam
# Sort
samtools sort \
-o bams/20200728_stickle2hdrr_sorted.bam \
-O bam \
bams/20200728_stickle2hdrr.bam
samtools sort \
-o bams/20200728_stickle2hdrr_asm20_sorted.bam \
-O bam \
bams/20200728_stickle2hdrr_asm20.bam
# Index
samtools index \
bams/20200728_stickle2hdrr_sorted.bam
# Call variants
gatk --java-options "-Xmx4g" HaplotypeCaller \
-R refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
-I bams/20200728_stickle2hdrr_sorted.bam \
-O vcfs/20200728_stickle2hdrr.g.vcf.gz \
-ERC GVCF
#A USER ERROR has occurred: Argument emit-ref-confidence has a bad value: Can only be used in single sample mode currently. Use the --sample-name argument to run on a single sample out of a multi-sample BAM file.
## Try bcftools
bcftools mpileup \
--output vcfs/20200728_stickle2hdrr.vcf.gz \
--output-type z \
-f refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
bams/20200728_stickle2hdrr_sorted.bam
# Has 85,669 records, but only 1549 SNPs
bcftools mpileup \
--output vcfs/20200728_stickle2hdrr_asm20.vcf.gz \
--output-type z \
-f refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
bams/20200728_stickle2hdrr_asm20_sorted.bam
# This one has 2,686,537 records and 301,521 SNPs
# Pipe to bcftools call
bcftools mpileup \
-Ou \
-f refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
bams/20200728_stickle2hdrr_asm20_sorted.bam | \
bcftools call \
-mv \
--output-type z \
--output vcfs/20200728_stickle2hdrr_asm20_called.vcf.gz
# This one has 125,002 SNPs
Data here: ftp://ftp.ensembl.org/pub/release-100/emf/ensembl-compara/multiple_alignments/47_fish.epo/
mkdir mult_algns
WGS Sequence Set: BAAE01000000.1: https://www.ebi.ac.uk/ena/browser/view/BAAE01 ftp://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/ba/BAAE01.fasta.gz From The medaka draft genome and insights into vertebrate genome evolution. Nature 447(7145): 714-719 (2007 Jun)
# From Ensembl
ftp://ftp.ensembl.org/pub/release-100/fasta/oryzias_latipes_hni/dna/
library(biomaRt)
listMarts()
# Connect to SNP database
ensembl <- useMart("ENSEMBL_MART_SNP")
# Doesn't seem to have medaka assemblies...
minigraphEnsembl pages here:
cd refs
# HNI
wget ftp://ftp.ensembl.org/pub/release-100/fasta/oryzias_latipes_hni/dna/Oryzias_latipes_hni.ASM223471v1.dna.toplevel.fa.gz
# HSOK
wget ftp://ftp.ensembl.org/pub/release-100/fasta/oryzias_latipes_hsok/dna/Oryzias_latipes_hsok.ASM223469v1.dna.toplevel.fa.gz
mkdir gfas
# Create reference graph with all three
minigraph -xggs -t16 \
refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz \
refs/Oryzias_latipes_hni.ASM223471v1.dna.toplevel.fa.gz \
refs/Oryzias_latipes_hsok.ASM223469v1.dna.toplevel.fa.gz \
> gfas/20200729_medaka.gfa
# Create reference graph with just HdrR
minigraph -xggs -t16 \
refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz \
> gfas/20200729_hdrr.r.gfa
# Create graph of HNI
mkdir gafs
minigraph -t16 \
gfas/20200729_hdrr.r.gfa \
refs/Oryzias_latipes_hni.ASM223471v1.dna.toplevel.fa.gz \
> gafs/20200730_hni2hdrr.gaf
# Send to local to look at with bandage
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/gafs/20200730_hni2hdrr.gaf ~/Documents/Data/20200707_mikk_ld
# Error: Please verify that this file has the correct format
# Send GFA with all 3 references to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/gfas/20200729_medaka.gfa ~/Documents/Data/20200707_mikk_ld
# Pull out SN column ('Name of stable sequence from which the segment is derived')
cut -f5 gfas/20200729_hdrr.r.gfa | sort | uniq
# Insert string into FASTA references to identify the chromosomes
/nfs/software/birney/bcftools-1.10.2/bcftools view \
-o vcfs/20200729_13_2_no-missing.vcf.gz \
--output-type z \
--samples 13_2 \
--genotype ^miss \
vcfs/full-run_line-ids.vcf.gz
# or a more efficient way with the plugin split
mkdir vcfs/indiv_split
bcftools +split \
--exclude 'GT~"\."' \
--output vcfs/indiv_split \
--output-type z \
--samples-file mikk_genome/data/20200730_vcf_split_samples_file.txt \
vcfs/full-run_line-ids.vcf.gz
# index
bcftools index \
--tbi \
vcfs/indiv_split/134_1.vcf.gz
# try with GATK
gatk FastaAlternateReferenceMaker \
-R refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
-O fastas_from_vcf/134_1_gatk.fa \
-V vcfs/indiv_split/134_1.vcf.gz
# try more directly with bcftools's consensus
mkdir fastas_from_vcf
bcftools consensus \
--fasta-ref refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz \
--output fastas_from_vcf/134_1.fa \
--sample 134_1 \
vcfs/full-run_line-ids.vcf.gz
bcftools consensus \
--fasta-ref refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz \
--output fastas_from_vcf/69_1.fa \
--sample 69_1 \
vcfs/full-run_line-ids.vcf.gz
# Insert sample name in chr ID
sed 's/>/>MIKK_134_1_chr/g' fastas_from_vcf/134_1.fa > fastas_from_vcf/134_1.id.fa
sed 's/>/>MIKK_69_1_chr/g' fastas_from_vcf/69_1.fa > fastas_from_vcf/69_1.id.fa
# Test GFA with just 134_1
minigraph -xggs -t16 \
gfas/20200729_hdrr.r.gfa \
fastas_from_vcf/134_1.id.fa \
> gfas/20200730_hdrr_134_1.r.gfa
# True GFA with 134_1 and 69_1
minigraph -xggs -t16 \
gfas/20200729_hdrr.r.gfa \
fastas_from_vcf/134_1.id.fa \
fastas_from_vcf/69_1.id.fa \
> gfas/20200730_hdrr-134_1-69_1.r.gfa
bandage# Print all offsets in HdrR chr 17
# 17:15557892-19561518
grep "SN:Z:17" gfas/20200730_hdrr_134_1.r.gfa | cut -f6 | sort | uniq
# Before start: SO:i:15403701
# After start: SO:i:16191302
# End: SO:i:19589961
# Get node numbers
## Before start
grep "SO:i:15403701" gfas/20200730_hdrr_134_1.r.gfa | cut -f2
#s8310
# After start
grep "SO:i:16191302" gfas/20200730_hdrr_134_1.r.gfa | cut -f2
#s8317
# End
grep "SO:i:19589961" gfas/20200730_hdrr_134_1.r.gfa | cut -f2
#s8390
# Middle node:
#s8350, with a distance of 40
# Looks unremarkable.
# First, add a shortcut to the shared folder onto My Drive
# Then use `rclone` to list the files in the directories you want
rclone ls google_drive_ebi:MIKK\ panel\ genome\ paper/Results/Graph\ Assembly/individual_assemblies/
# matching pattern
rclone lsf --absolute --include "*clean.fa" google_drive_ebi:MIKK\ panel\ genome\ paper/Results/Graph\ Assembly/individual_assemblies
# copy
rclone copy --include "*clean.fa" google_drive_ebi:MIKK\ panel\ genome\ paper/Results/Graph\ Assembly/individual_assemblies/ /hps/research1/birney/users/ian/mikk_paper/fastas_ont
minigraph -xggs -t16 \
gfas/20200729_hdrr.r.gfa \
fastas_ont/134-1_H4_clean.fa \
fastas_ont/69-1_F3_clean.fa \
> gfas/20200730_hdrr-134_1_ont-69_1_ont.r.gfa
# node ID
grep "SN:Z" gfas/20200730_hdrr-134_1_ont-69_1_ont.r.gfa | cut -f2 > mikk_genome/data/20200730_hdrr-134_1_ont-69_1_ont_labels.csv
# chr number
grep "SN:Z" gfas/20200730_hdrr-134_1_ont-69_1_ont.r.gfa | cut -f5 | cut -d":" -f3 > mikk_genome/data/20200730_hdrr-134_1_ont-69_1_ont_labels_chr.csv
# paste together
paste \
mikk_genome/data/20200730_hdrr-134_1_ont-69_1_ont_labels.csv \
mikk_genome/data/20200730_hdrr-134_1_ont-69_1_ont_labels_chr.csv \
> mikk_genome/data/20200730_hdrr-134_1_ont-69_1_ont_labels_final.csv
mikk_genome/code/scripts/20200731_add_colours_and_line_labels.Rdata <- read.delim("~/Documents/Repositories/mikk_genome/data/20200730_hdrr-134_1_ont-69_1_ont_labels_final.csv",
header = F)
# Add colnames
colnames(data) <- c("node", "chr")
# Add line column
data$line <- vector(mode = "character",
length = nrow(data))
# Get HdrR
data$line[data$chr %in% c(seq(1, 24), "MT")] <- "HdrR"
# Get MIKK
data$line[!(data$chr %in% c(seq(1, 24), "MT"))] <- sapply(data$chr[!(data$chr %in% c(seq(1, 24), "MT"))], function(x){
paste(stringr::str_split(x, pattern = "_", simplify = T)[1:2], collapse = "_")
})
# Get colours
l_lines <- read.delim("~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1_ont-69_1_ont.L.r.gfa",
header = F)
colnames(l_lines) <- c("record_type", "from", "from_orient", "to", "to_orient", "overlap", "rank", "seg1_length", "seg2_length")
# find nodes which are positive but lead to a negative
length(l_lines$from[l_lines$from_orient == "+" & l_lines$to_orient == "-"])
# 33836
# How many unique?
length(unique(l_lines$from[l_lines$from_orient == "+" & l_lines$to_orient == "-"]))
# 33834
# Find which ones are repeated
l_lines$from[l_lines$from_orient == "+" & l_lines$to_orient == "-"][which(duplicated(l_lines$from[l_lines$from_orient == "+" & l_lines$to_orient == "-"]))]
# Create vector of reversies
reversed_nodes <- unique(l_lines$from[l_lines$from_orient == "+" & l_lines$to_orient == "-"])
# assign direction to data
data$direction <- ifelse(data$node %in% reversed_nodes, "-", "+")
# assign colours
data$colour <- ifelse(data$line == "HdrR",
"#0BC166",
ifelse(data$line == "MIKK_134-1" & data$direction == "+",
"#FC4E07",
ifelse(data$line == "MIKK_134-1" & data$direction == "-",
"#E7B800",
ifelse(data$line == "MIKK_69-1" & data$direction == "+",
"#360568",
ifelse(data$line == "MIKK_69-1" & data$direction == "-",
"#D84797",
NA)))))
write.table(data, "~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1_ont-69_1_ont_labels.csv", quote = F, row.names = F, col.names = T, sep = ",")
# They're not.
grep "SN:Z:17" gfas/20200730_hdrr-134_1_ont-69_1_ont.r.gfa | cut -f6 | uniq
# Start
grep "SN:Z:17" gfas/20200730_hdrr-134_1_ont-69_1_ont.r.gfa | grep "SO:i:15570516" | cut -f2
#s99963
# End
grep "SN:Z:17" gfas/20200730_hdrr-134_1_ont-69_1_ont.r.gfa | grep "SO:i:19568172" | cut -f2
#s101030
# Target node:
101030 + ((101030-99963)/2)
# s101588
# Range
(101032-99920)/2
# 556
# Create file with offset coordinates for easy reading
grep "SN:Z:17" ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1_ont-69_1_ont.r.gfa | awk -v OFS='\t' '{print $2, $4, $5, $6}' > ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200731_hdrr-134_1_ont-69_1_ont_chr17.txt
chr17 <- read.delim("~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200731_hdrr-134_1_ont-69_1_ont_chr17.txt", header = F)
Notes:
Small reverse strands from 134_1 start around s99950 (17:15,538,207) Forward strand from s100297 (17:16,871,942) to s100665 (17:18,303,112) Small reverse strands start again from s100668 (18,330,419) to s100748 (18,503,101) Forward strands from s100748 (18,503,101) to s101016 (19,503,814) Small reverse strands from s101016 (19,503,814) to s101353 (21,058,438)
Start: s96970. Short forward strands. * s96970 (0): Forward 942K * s97122 (942,226): Reverse 677k
* s97295 (1,619,320): Forward
* s97386 (1,940,652): Reverse 6923k * s98830 (8,864,153): Forward * s98895 (9,277,441): Reverse * s98941 (9,508,811): Forward * s99379 (12,444,863): Reverse * s99766 (14,754,500): Forward * s99950 (15,538,207): Reverse * s100297 (16,871,942): Forward * s100668 (18,330,419): Reverse * s100748 (18,503,101): Forward * s101016 (19,503,814): Reverse * s101358 (21,114,902): Forward * s101459 (21,652,350): Reverse * s101821 (23,753,658): Forward * s101874 (24,106,633): Reverse * s102501 (28,375,703): Forward
Hmm, not sure if strand is significant. Counts between + and - look pretty even for MIKK lines:
dplyr::count(data, line, direction)
# node ID
grep "SN:Z" ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1.r.gfa | cut -f2 > ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/tmp1.txt
# chr number
grep "SN:Z" ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1.r.gfa | cut -f5 | cut -d":" -f3 > ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/tmp2.txt
# paste together
paste \
~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/tmp1.txt \
~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/tmp2.txt \
> ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1.r.gfa.labs
rm ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/tmp*
data_ill <- read.delim("~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1.r.gfa.labs",
header = F)
# Add colnames
colnames(data_ill) <- c("node", "chr")
# Add line
data_ill$line <- vector(mode = "character",
length = nrow(data_ill))
# Get HdrR
data_ill$line[data_ill$chr %in% c(seq(1, 24), "MT")] <- "HdrR"
# Get MIKK
data_ill$line[!(data_ill$chr %in% c(seq(1, 24), "MT"))] <- sapply(data_ill$chr[!(data_ill$chr %in% c(seq(1, 24), "MT"))], function(x){
paste(stringr::str_split(x, pattern = "_", simplify = T)[1:3], collapse = "_")
})
# Get colours
data_ill$colour <- ifelse(data_ill$line == "HdrR",
"#0BC166",
ifelse(data_ill$line == "MIKK_134_1",
"#FC4E07",
ifelse(data_ill$line == "MIKK_69_1",
"#360568",
NA)))
# Write table
write.table(data_ill, "~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1_labels.csv", quote = F, row.names = F, col.names = T, sep = ",")
grep "^L" ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1.r.gfa > ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1.L.r.gfa
l_lines_ill <- read.delim("~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1.L.r.gfa",
header = F)
colnames(l_lines_ill) <- c("record_type", "from", "from_orient", "to", "to_orient", "overlap", "rank", "seg1_length", "seg2_length")
# Create file with offset coordinates for easy reading
grep "SN:Z:17" ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1.r.gfa | awk -v OFS='\t' '{print $2, $4, $5, $6}' > ~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1_chr17.txt
chr17_ill <- read.delim("~/Documents/Data/20200707_mikk_ld/20200730_graph_genomes/20200730_hdrr-134_1-69_1_chr17.txt", header = F)
From s11711 (15,483,925) to s11841 (20,082,646)
Here: https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome
# SETUP
mkcd abbababba_tutorial
mkdir data
cd data
wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz
wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_whole_genome/data/hel92.pop.txt
cd ..
wget https://github.com/simonhmartin/genomics_general/archive/master.zip
unzip master.zip
# Get allele freqs
## NOTE: needed to change `xrange` to `range` in python script to be compatible with Python 3.7
python genomics_general-master/freq.py -g data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz \
-p mel_mel -p mel_ros -p mel_vul -p mel_mal -p mel_ama \
-p cyd_chi -p cyd_zel -p tim_flo -p tim_txn -p num \
--popsFile data/hel92.pop.txt --target derived \
-o data/hel92.DP8MP4BIMAC2HET75dist250.derFreq.tsv.gz
Tutorial here: https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome
cd refs
wget ftp://ftp.ensembl.org/pub/release-100/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz
# From Ensembl
cd refs
# HNI
wget ftp://ftp.ensembl.org/pub/release-100/fasta/oryzias_latipes_hni/dna/Oryzias_latipes_hni.ASM223471v1.dna.toplevel.fa.gz
# HSOK
wget ftp://ftp.ensembl.org/pub/release-100/fasta/oryzias_latipes_hsok/dna/Oryzias_latipes_hsok.ASM223469v1.dna.toplevel.fa.gz
# Try with minimap2, asm20
mkdir sams
## Stickleback
minimap2 -ax asm20 \
refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
refs/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz \
> sams/20200728_stickle2hdrr_asm20.sam
## HNI and HSOK
for i in $(find refs/Oryzias_latipes_h*.fa.gz); do
name=$(echo $i | cut -f1 -d'.' | cut -f3 -d'_');
bsub -M 30000 -n 12 -o log/20200806_map_$name.out -e log/20200806_map_$name.err \
"minimap2 -ax asm20 \
refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
$i \
> sams/20200806_$name\2hdrr_asm20.sam";
done
### Try with asm5
for i in $(echo hni hsok); do
name=$(echo $i | cut -f1 -d'.' | cut -f3 -d'_');
bsub -M 20000 -n 12 -o log/20200806_map_$i\_asm5.out -e log/20200806_map_$i\_asm5.err \
"minimap2 -ax asm5 \
refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
refs/Oryzias_latipes_$i.ASM223471v1.dna.toplevel.fa.gz \
> sams/20200806_$i\2hdrr_asm5.sam";
done
# Convert to BAM
mkdir bams
samtools view \
-b sams/20200728_stickle2hdrr_asm20.sam > bams/20200728_stickle2hdrr_asm20.bam
## HNI and HSOK
for i in $(echo hni hsok); do
samtools view \
-b sams/20200806_$i\2hdrr_asm20.sam \
> bams/20200806_$i\2hdrr_asm20.bam;
done
# Sort
samtools sort \
-o bams/20200728_stickle2hdrr_asm20_sorted.bam \
-O bam \
bams/20200728_stickle2hdrr_asm20.bam
## HNI and HSOK
for i in $(echo hni hsok); do
samtools sort \
-o bams/20200806_$i\2hdrr_asm20_sorted.bam \
-O bam \
bams/20200806_$i\2hdrr_asm20.bam;
done
# Call variants with bcftools
bcftools mpileup \
--output vcfs/20200728_stickle2hdrr_asm20.vcf.gz \
--output-type z \
-f refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
bams/20200728_stickle2hdrr_asm20_sorted.bam
# This one has 2,686,537 records and 301,521 SNPs
# Pipe to bcftools call
bcftools mpileup \
-Ou \
-f refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
bams/20200728_stickle2hdrr_asm20_sorted.bam | \
bcftools call \
-mv \
--output-type z \
--output vcfs/20200728_stickle2hdrr_asm20_called.vcf.gz
# This one has 125,002 SNPs
## HNI and HSOK
for i in $(echo hni hsok); do
bsub -M 30000 -n 12 -o log/20200806_pileup-and-call_$name.out -e log/20200806_pileup-and-call_$name.err \
"bcftools mpileup \
-Ou \
-f refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa \
bams/20200806_$i\2hdrr_asm20_sorted.bam | \
bcftools call \
-mv \
--output-type z \
--output vcfs/20200806_$i\2hdrr_asm20_called.vcf.gz";
done
# Index VCFs
bcftools index --tbi vcfs/20200728_stickle2hdrr_asm20_called.vcf.gz
## HNI and HSOK
for i in $(echo hni hsok); do
bcftools index --tbi vcfs/20200806_$i\2hdrr_asm20_called.vcf.gz;
done
# Merge
bcftools merge \
--output vcfs/panel_no-sibs_line-ids_stickle.vcf.gz \
--output-type z \
vcfs/20200728_stickle2hdrr_asm20_called.vcf.gz \
vcfs/panel_no-sibs_line-ids.vcf.gz
# WORKS
# Merge all, but using no-missing VCF
bcftools merge \
--output vcfs/panel_no-sibs_line-ids_no-missing_stickle-hni-hsok.vcf.gz \
--output-type z \
vcfs/20200728_stickle2hdrr_asm20_called.vcf.gz \
vcfs/20200806_hni2hdrr_asm20_called.vcf.gz \
vcfs/20200806_hsok2hdrr_asm20_called.vcf.gz \
vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz
# Rename samples
bcftools reheader \
--output vcfs/panel_no-sibs_line-ids_no-missing_stickle-hni-hsok_reheader.vcf \
--samples mikk_genome/data/20200806_introgression_old-new-sample_names.txt \
vcfs/panel_no-sibs_line-ids_no-missing_stickle-hni-hsok.vcf.gz
# Remove missing and multiallelic sites, and compress
bcftools view \
--genotype ^miss \
--min-alleles 2 \
--max-alleles 2 \
--output-file vcfs/panel_no-sibs_line-ids_no-missing_stickle-hni-hsok_reheader_no-missing.vcf.gz \
--output-type z \
vcfs/panel_no-sibs_line-ids_no-missing_stickle-hni-hsok_reheader.vcf
# Only has 963 SNPs
# How many per chromosome?
gunzip -c vcfs/panel_no-sibs_line-ids_no-missing_stickle-hni-hsok_reheader_no-missing.vcf.gz | \
grep -v "#" | \
cut -f1 | \
uniq -c
# 1 3
# 2 16
# 3 124
# 4 28
# 5 23
# 6 62
# 7 70
# 8 4
# 9 77
# 10 32
# 12 1
# 13 5
# 14 11
# 15 155
# 16 43
# 17 1
# 18 2
# 19 3
# 20 99
# 21 143
# 22 50
# 23 11
Here: https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome
mkcd abbababa_tutorial
mkcd data
wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz
wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_whole_genome/data/hel92.pop.txt
cd ..
# Download python scripts
wget https://github.com/simonhmartin/genomics_general/archive/master.zip
unzip master.zip
# Get genome-wide allele frequencies
python genomics_general-master/freq.py -g data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz \
-p mel_mel -p mel_ros -p mel_vul -p mel_mal -p mel_ama \
-p cyd_chi -p cyd_zel -p tim_flo -p tim_txn -p num \
--popsFile data/hel92.pop.txt --target derived \
-o data/hel92.DP8MP4BIMAC2HET75dist250.derFreq.tsv.gz
# data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz has 884,628 variants
bcftools query -l vcfs/panel_no-sibs_line-ids_no-missing_stickle-hni-hsok_reheader_no-missing.vcf.gz > mikk_genome/data/20200806_popfile.txt
# on cluster
data <- data.frame("line" = read.delim("mikk_genome/data/20200806_popfile.txt",
header = F),
stringsAsFactors = F)
data$pop <- c("stickleback", "hni", "hsok", rep("mikk", 63))
write.table(data,
"mikk_genome/data/20200806_popfile.txt",
quote = F,
row.names = F,
col.names = F,
sep = "\t")
Find 47 fish EPO. Will be a parsing nightmare.
Need to parse the tree. Load into tree viewer. Maybe use oryzias indo. Want the closest ancestor to the split between hdr, hni and hsok.. Pretty sure that’ll be javanicus. Count the lines down - 10th sequence down. Then down on the sequence you count 10th column. First is latipes. For every SNP in latipes, one will be the refernce, and one will be the alternate. Either one or the other will be the ancestor. A diagnostic thing - do MAF with ancestral and derived, then make a histogram with derived allele. Should show same decay.
20200901
Directory here:
ftp://ftp.ensembl.org/pub/release-100/emf/ensembl-compara/multiple_alignments/47_fish.epo/
Contains two READMEs: 1. README.emf - 2. README.47_fish.epo - EPO = Enredo-Pecan-Ortheus multiple alignment. - Enredo builds a set of co-linear regions between the genomes; Pecan aligns these whole sets of sequences; Ortheus uses the Pecan alginments to infer the ancestral sequences. - Alignments are grouped by HdrR chromosomes. The fiels names .other.emf contain alignments that do not inlcude any HdrR region. - Contains the species tree. HdrR and HNI are grouped together, then HSOK, then javanicus and melastigma (0.0375962 and 0.0447938 respectively; 0.016843 as a group).
To get the tree, copy the TREE section from the .EMF files, and paste into <phylo.io>.
Pic of the tree saved here: ~/Documents/Docs/medaka pics/20200602_mikk_genome/20200901_phylo_tree.png
Ancestor connecting HdrR and HNI: Aseq_Ancestor_1828_105991 Ancestor connecting those two and HSOK: Aseq_Ancestor_1828_105989 Ancestor connecting javanicus and melastigma: Aseq_Ancestor_1828_105995 Ancestor connecting those two and latipes: Aseq_Ancestor_1828_105986
So we want the following columns:
1: HdrR 2: Aseq_Ancestor_1828_105991 3: HNI 4: Aseq_Ancestor_1828_105989 5: HSOK 6: Aseq_Ancestor_1828_105986 7: javanicus 8: Aseq_Ancestor_1828_105995 9: melastigma
Meeting with Ewan: Use Aseq_Ancestor_1828_105986 as the outgroup.
mkdir emfs
# download
wget -P emfs/ ftp://ftp.ensembl.org/pub/release-100/emf/ensembl-compara/multiple_alignments/47_fish.epo/*
# unzip into new directory
mkdir emfs/unzipped
for i in $(find emfs/47_fish.epo.[0-9]*); do
name=$(basename $i | cut -f3,4 -d'.');
bsub "zcat $i > emfs/unzipped/$name";
done
# NOTE that file 17_1 is in a completely different format, with CIGAR strings instead of the normal SEQ, TREE, ID and DATA segments.
# want to remove all lines starting with $VAR, a space (make it two to be safe), and all consecutive blank lines (leaving single blank lines)
# remove all lines starting with $VAR or two spaces, and all consecutive blank lines
sed '/\$VAR1/d; /\ \ /d' emfs/unzipped/17_1.emf | uniq -u > emfs/unzipped/tmp.txt
# replace old file
mv emfs/unzipped/tmp.txt emfs/unzipped/17_1.emf
# find line numbers where DATA starts (actual data starts line after)
grep -n "^DATA" emfs/unzipped/10_1.emf | cut -f1 -d":"
# find line numbers where DATA ends (actual data ends line before)
grep -n "//" emfs/unzipped/10_1.emf | cut -f1 -d":"
# find line numbers of HdrR sequence start and end
grep -n "^SEQ oryzias_latipes 10" emfs/unzipped/10_1.emf
# find first instance
grep -n -m 1 "^DATA" emfs/unzipped/10_1.emf
# find blank lines
grep -n ^$ emfs/unzipped/10_1.emf
# prints 200... 1 for each segment
# get line number of specific blank line
grep -m 2 -n ^$ emfs/unzipped/10_1.emf | tail -n1 | cut -f1 -d":"
# print within range
sed -n '8,962579p' emfs/unzipped/10_1.emf | grep "^SEQ"
mkdir emfs/segmented
for i in $(find emfs/unzipped/10_1.emf); do
# get basename
bname=$(basename $i);
bname_short=$(echo ${bname::-4});
# get chromosome
chr=$(echo $bname | cut -f1 -d"_");
# make directory for each EMF file
mkdir emfs/segmented/$bname_short;
# get segment count
segment_count=$(grep "^DATA" $i | wc -l);
# get segment start and end for each file
for j in $(seq 1 $segment_count | head -1); do
target_line=$(grep -m $j "^SEQ oryzias_latipes $chr" $i | tail -n1 );
segment_start=$(echo $target_line | cut -f4 -d" " );
segment_end=$(echo $target_line | cut -f5 -d" " );
strand=$(echo $target_line | cut -f6 -d" " );
echo -e "$chr\t$segment_start\t$segment_end\t$strand";
file_name_data=$(echo $chr\_$segment_start\_$segment_end\_$strand.data.txt);
# pull out data
chunk_start=$(grep -m $j -n "^DATA" $i | tail -n1 | cut -f1 -d":");
chunk_end=$(grep -m $j -n "//" $i | tail -n1 | cut -f1 -d":");
chunk_line_count=$(expr $(echo $chunk_end) - $(echo $chunk_start) - 1 );
awk -F"" "NR==$(expr $(echo $chunk_start) + 1),NR==$(expr $(echo $chunk_end) - 1)" $i | head \
> emfs/segmented/$bname_short/$file_name_data;
# pull out metadata
file_name_meta=$(echo $chr\_$segment_start\_$segment_end\_$strand.meta.txt);
blank_line_start=$(grep -m $j -n ^$ $i | tail -n1 | cut -f1 -d":");
counter_plus_1=$(expr $(echo $j) + 1);
blank_line_end=$(grep -m $counter_plus_1 -n ^$ $i | tail -n1 | cut -f1 -d":");
awk "NR==$blank_line_start,NR==$blank_line_end" $i | grep "^SEQ\|^TREE" \
> emfs/segmented/$bname_short/$file_name_meta;
done;
done
# Use script to parallelise each segment
for i in $(find emfs/unzipped/* | awk 'NR==41,NR==59' ); do
# get basename
bname=$(basename $i);
bname_short=$(echo ${bname::-4} );
# get chromosome
chr=$(echo $bname | cut -f1 -d"_" );
# make directory for each EMF file
mkdir emfs/segmented/$bname_short;
# get segment count
segment_count=$(grep "^DATA" $i | wc -l );
# get segment start and end for each file
for j in $(seq 1 $segment_count ); do
bsub \
-o log/$bname_short\_$j.out \
-e log/$bname_short\_$j.out \
"./mikk_genome/code/scripts/20200907_extract-emf-segments.sh $i $j $bname_short"
done;
done
# 20200922 amending due to misalignment for the 10_2.emf file
# e.g. head -1 emfs/segmented/10_2/*.seq.txt gives
# ==> emfs/segmented/10_2/10_14744774_14768698_1.seq.txt <==
# SEQ oryzias_latipes 10 20304438 20354944 1 (chr_length=31218526)
# grep "^SEQ oryzias_latipes " $i | grep -n "20304438 20354944" gives
# 114
for i in $(find emfs/unzipped/* ); do
# get basename
bname=$(basename $i);
bname_short=$(echo ${bname::-4} );
# get chromosome
chr=$(echo $bname | cut -f1 -d"_" );
# make directory for each EMF file
new_path=$(echo emfs/segmented/$bname_short );
if [ ! -d "$new_path" ]; then
mkdir $new_path;
fi
# get segment count
segment_count=$(grep "^DATA" $i | wc -l );
# get segment start and end for each file
for j in $(seq 1 $segment_count ); do
bsub \
-o log/20200922_$bname_short\_$j.out \
-e log/20200922_$bname_short\_$j.out \
"./mikk_genome/code/scripts/20200907_extract-emf-segments.sh $i $j $new_path "
done;
done
find emfs/segmented/*/*.data.txt | wc -l
# Creates 9338 files
find emfs/segmented/*/*_1.data.txt | wc -l
# 4552 forward strand
# 4786 reverse strand
for i in $(find emfs/unzipped/10_2* ); do
# get basename
bname=$(basename $i);
bname_short=$(echo ${bname::-4} );
# get chromosome
chr=$(echo $bname | cut -f1 -d"_" );
# make directory for each EMF file
new_path=$(echo emfs/segmented/$bname_short );
if [ ! -d "$new_path" ]; then
mkdir $new_path;
fi
# get segment count
segment_count=$(grep "^DATA" $i | wc -l );
# get segment start and end for each file
for j in $(seq 1 $segment_count ); do
bsub \
-o log/20200922_$bname_short\_$j.out \
-e log/20200922_$bname_short\_$j.out \
"./mikk_genome/code/scripts/20200907_extract-emf-segments.sh $i $j $new_path "
done;
done
Find out whether it lines up with the HdrR FASTA
# Pull out first column of data
awk "NR==$(expr $(echo $chunk_start) + 1),NR==$(expr $(echo $chunk_end) - 1)" $i | head | awk -F "" -v OFS="\t" '{gsub("","\t"); print}'
awk "NR==$(expr $(echo $chunk_start) + 1),NR==$(expr $(echo $chunk_end) - 1)" $i | head | awk '$1=$1' FS="" OFS="\t"
# WORKS
# Bring over with rmate and save on Desktop
# Total in segment
10267882 - 10089051
# 178831
# Read in
test <- scan(file = "~/Desktop/tmp.txt",
what = character())
# How many aren't "-"
length(which(!test == "-"))
#178832
# Perfect!
grep "^SEQ" emfs/segmented/10_1/10_10089051_10267882_1.meta.txt
# First see whether the sequences are the same as the FASTA
# pull out sequence
samtools faidx refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa 10:10089051-10267882
# 10267882 - 10089051
178831
# count number of chracters
samtools faidx refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa 10:10089051-10267882 | tail -n+2 | tr -d \\n | wc -c
# 178832
# bang on!
# paste to file
samtools faidx refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa 10:10089051-10267882 | tail -n+2 | tr -d \\n > tmp1.txt
ed -s tmp1.txt <<< w # add new line to end
# paste first column of data file to another file
cut -f1 emfs/segmented/10_1/10_10089051_10267882_1.data.txt | tr -d \\n | sed 's/-//g' | tr [a-z] [A-Z] > tmp2.txt
ed -s tmp2.txt <<< w
# compare
cmp tmp1.txt tmp2.txt
## SUCCESS!
# try with reverse strand segments
emfs/segmented/10_1/10_10031232_10089050_-1.data.txt
# paste to file - note `-i` flag for reverse complement
samtools faidx -i refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa 10:10031232-10089050 | tail -n+2 | tr -d \\n > tmp1.txt
ed -s tmp1.txt <<< w # add new line to end
# paste first column of data file to another file
cut -f1 emfs/segmented/10_1/10_10031232_10089050_-1.data.txt | tr -d \\n | sed 's/-//g' | tr [a-z] [A-Z] > tmp2.txt
ed -s tmp2.txt <<< w
# compare
cmp tmp1.txt tmp2.txt
# SUCCESS!
# get names of species
grep "^SEQ" emfs/segmented/10_1/10_10089051_10267882_1.meta.txt | awk 'OFS="_" {print $2,$3}'
library(geiger) # also loads `ape`
phylo_tree <- read.tree(file = "~/Desktop/test_tree.txt")
# plot
ape::plot.phylo(phylo_tree, show.node.label = T)
# get vector of IDs
ids <- phylo_tree$tip.label[grep("Olat|Ohni|Ohso|Omel|Ojav", phylo_tree$tip.label)]
# get node number of most recent common ancestor
node_number <- ape::getMRCA(phylo_tree, tip = ids)
# get label of node
mrca_label <- c(phylo_tree$tip.label, phylo_tree$node.label)[node_number]
mkdir emfs/cleaned
# Find which ones have multiple columns for HdrR
grep "Hit count: 2" log/20200922_10_2*
#log/20200922_10_2_43.out:Hit count: 2
#log/20200922_10_2_64.out:Hit count: 2
file_in <- "emfs/segmented/10_2/10_16101474_16156360_1"
# TEST
Rscript --vanilla mikk_genome/code/scripts/20200908_add-hdrr-coords-to-emf-data.R emfs/segmented/10_1/10_10031232_10089050_-1
# Note this only takes the sequences on the forward strand - still gets 4552 / 9338 segments
for i in $(find emfs/segmented/*/*_1.data.txt ); do
# make new directory
chr_file=$(echo $i | cut -f3 -d"/" );
new_path=$(echo emfs/cleaned/$chr_file );
if [ ! -d "$new_path" ]; then
mkdir $new_path;
fi
# create name for new file
bname=$(echo $i | sed 's/.data.txt//g' );
bname_short=$(basename $bname );
bsub \
-M 10000 \
-o log/20200925_$chr_file\_$bname_short.out \
-e log/20200925_$chr_file\_$bname_short.err \
"Rscript --vanilla mikk_genome/code/scripts/20200908_add-hdrr-coords-to-emf-data.R $bname $new_path";
done
# 114 exited with error codes (4438 successful):
grep "Execution halted" log/20200925*err
# All of those errors are from file 17_1, which are caused by differences in the length of the purported segment (based on the SEQ data), and the number of rows in the data frame once the '-' entries are filtered out, e.g.:
# file_in <- "emfs/segmented/17_1/17_8966018_9533915_1").
# Not much we can do about that. Move on.
Proceed with those for the time being…
Pull two files
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/emfs/cleaned/1_1/1_10018060_10055052_1.txt ~/Documents/Data/20200910_abbababa_test
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/emfs/cleaned/1_1/1_1004861_1059491_1.txt ~/Documents/Data/20200910_abbababa_test
Do operation on R
files <- list.files("~/Documents/Data/20200910_abbababa_test", full.names = T)
# read data files into list and process
dat_list <- lapply(files, function(x){
# read in data
df <- read.delim(x, header = T)
# change `chr` and `coord` columns into integers
df$chr <- as.integer(df$chr)
df$coord <- as.integer(df$coord)
# change name of ancestor in header
colnames(df)[grep("Ancestor", colnames(df))] <- "ancestor"
# capitalise all letters
df <- df %>%
dplyr::mutate(across(.cols = c(-chr, -coord),.fns = toupper))
return(df)
})
# tidy data
dat_df <- dplyr::bind_rows(dat_list) %>% # bind into single DF
dplyr::arrange(coord) %>% # sort by coordinate
tidyr::unite("chr_pos", chr, coord, sep = ":", remove = F)# create new column
dplyr::select(chr_pos, # order columns
chr,
coord,
oryzias_latipes,
oryzias_latipes_hni,
oryzias_latipes_hsok,
ancestor,
oryzias_javanicus,
oryzias_melastigma)
Pull three files
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/emfs/consolidated/10_1.txt ~/Documents/Data/20200910_abbababa_test
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/emfs/consolidated/10_2.txt ~/Documents/Data/20200910_abbababa_test
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/emfs/consolidated/10_3.txt ~/Documents/Data/20200910_abbababa_test
dir_in <- "~/Documents/Data/20200910_abbababa_test"
chr <- "10"
chr_prefix <- paste(chr, "_", sep = "")
files <- list.files(dir_in, pattern = chr_prefix, full.names = T)
dat_list <- lapply(files, function(x){
# read in data
df <- read.delim(x, header = T)
return(df)
})
names(dat_list) <- stringr::str_split(basename(files), pattern = "\\.", simplify = T)[, 1]
dat_df <- dplyr::bind_rows(dat_list, .id = "chr_file") %>% # bind into single DF
dplyr::arrange(coord) %>% # sort by coordinate
dplyr::select(chr_pos, # order columns
chr,
coord,
oryzias_latipes,
oryzias_latipes_hni,
oryzias_latipes_hsok,
ancestor,
oryzias_javanicus,
oryzias_melastigma,
chr_file) %>%
dplyr::mutate(across(everything(), ~na_if(.x, "-")))
mkdir emfs/consolidated
# run for each sub-chr (e.g. 10_1, 10_2)
for i in $(find emfs/cleaned/* -type d ); do
# get chromosome
chr=$(echo $i | cut -f3 -d"/" );
# consolidate into one file
bsub \
-M 20000 \
-o log/20200925_consol_$chr.out \
-e log/20200925_consol_$chr.err \
"Rscript --vanilla mikk_genome/code/scripts/20200910_consolidate_chr_dat.R $i emfs/consolidated/$chr.txt ";
done
# combine for each chr
mkdir emfs/final
for i in $(find emfs/consolidated/* | cut -f3 -d"/" | cut -f1 -d"_" | sort | uniq ); do
bsub \
-M 20000 \
-o log/20200925_full_chr_$i.out \
-e log/20200925_full_chr_$i.err \
"Rscript --vanilla mikk_genome/code/scripts/20200911_consolidate_chr_files.R $i emfs/consolidated emfs/final ";
done
# Get allele frequency of biallelic SNPs only
bcftools view \
--min-alleles 2 \
--max-alleles 2\
--types snps \
--output-type u \
vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz | \
bcftools +fill-tags \
--output-type z \
--output vcfs/panel_no-sibs_line-ids_no-missing_bi-snps_with-af.vcf.gz \
-- \
--tags AF
# Try with VCFtools
vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--freq \
--remove-indels \
--min-alleles 2 \
--max-alleles 2 \
--out maf/20200910_biallelic-snps_af
# Output has 16,035,052 hits with this format:
#CHROM POS N_ALLELES N_CHR {ALLELE:FREQ}
#1 15430 2 126 G:0.992063 A:0.00793651
# Try with BCFtools
bcftools query \
--format '%CHROM\t%POS\t%REF\t%ALT\t%INFO/AF\n' \
--output maf/20200910_af.txt \
vcfs/panel_no-sibs_line-ids_no-missing_bi-snps_with-af.vcf.gz
# Output has the same amount of hits - 16,035,052 - with this format:
#1 15430 G A 0.00793651 where the freq is of the alt allele
# Split by chromosome and paste together chr:pos columns
mkdir maf/20200910_split-by-chr
for i in $(cut -f1 maf/20200910_af.txt | sort | uniq); do
awk "\$1 == $i" maf/20200910_af.txt | \
awk -v OFS="\t" '{print $1":"$2, $0}' \
> maf/20200910_split-by-chr/$i.txt ;
done
# Pull to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/emfs/final/23.txt ~/Documents/Data/20200910_abbababa_test/23_final.txt
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/maf/20200910_split-by-chr/23.txt ~/Documents/Data/20200910_abbababa_test/23_af.txt
# read in data
final_dat <- read.delim("~/Documents/Data/20200910_abbababa_test/23_final.txt", header = T)
af_dat <- readr::read_tsv("~/Documents/Data/20200910_abbababa_test/23_af.txt",
col_names = c("chr_pos", "chr", "pos", "ref", "alt", "mikk"),
col_types = "ciiccd")
# rename columns
colnames(final_dat)[colnames(final_dat) == "oryzias_latipes"] <- "hdrr" # HdrR
colnames(final_dat) <- gsub("oryzias_latipes_", "", colnames(final_dat)) # HNI and HSOK
colnames(final_dat) <- gsub("oryzias_", "", colnames(final_dat)) # other medaka
# join DFs
joined_dat <- dplyr::inner_join(af_dat,
dplyr::select(final_dat,
!c(chr, coord, chr_file)),
by = "chr_pos") %>%
tidyr::drop_na(ancestor)
# set ancestral and derived alleles and get
final_dat <- joined_dat %>%
dplyr::mutate(ancestral = ancestor,
derived = dplyr::if_else(ancestor == ref,
alt,
ref),
mikk = dplyr::if_else(ancestral == ref,
mikk,
1 - mikk ),
mutate(across(hdrr:melastigma,
~dplyr::if_else(.x == derived,
1,
0)))) %>%
dplyr::select(chr, pos, ancestral, derived,
mikk, hdrr, hni, hsok, javanicus, melastigma)
# !!var_names[1] := dplyr::if_else(.data[[pop_1]] == der_al,
# 1,
# 0),
# !!var_names[2] := dplyr::if_else(.data[[pop_2]] == der_al,
# 1,
# 0))
# get target columns for final dat
pop_1 <- "hdrr"
pop_2 <- "hni"
pop_3 <- "ancestor"
pops <- c(pop_1, pop_2, pop_3)
target_cols <- c("chr_pos", pops)
#%>%
# tidyr::drop_na() # take only complete cases
# set ancestral and derived allele
var_names <- paste(pops, "frq", sep = "_")
test <- joined_dat %>%
dplyr::mutate(anc_al = ancestor,
der_al = dplyr::if_else(ancestor == ref,
alt,
ref),
mikk_frq = dplyr::if_else(ancestor == ref,
mikk,
1 - mikk ),
!!var_names[1] := dplyr::if_else(.data[[pop_1]] == der_al,
1,
0),
!!var_names[2] := dplyr::if_else(.data[[pop_2]] == der_al,
1,
0)) %>%
dplyr::select(chr,
pos,
anc_al,
der_al,
mikk = mikk_frq,
!!pops[1] := !!var_names[1],
!!pops[2] := !!var_names[2])
mkdir abba_baba_mikk
mkdir abba_baba_mikk/freq_tables
for i in $( seq 1 24 ); do
dat_file=$( echo emfs/final/$i.txt );
af_file=$( echo maf/20200910_split-by-chr/$i.txt );
bsub \
-M 30000 \
-o log/20200925_frq_tbl_$i.out \
-e log/20200925_frq_tbl_$i.err \
"Rscript --vanilla mikk_genome/code/scripts/20200911_combine_final_and_af_data.R $i $dat_file $af_file abba_baba_mikk/freq_tables ";
done
Run directly on cluster
files <- list.files("abba_baba_mikk/freq_tables",
pattern = "[0-9]",
full.names = T)
# read in data
dat_list <- lapply(files, function(x){
# read in data
df <- readr::read_tsv(x,
col_types = "iiccdiiiii")
return(df)
})
# bind together and sort
final_df <- dplyr::bind_rows(dat_list) %>% # bind into single DF
dplyr::arrange(chr, pos) # sort by chromosome, then position
# write table
write.table(final_df, "abba_baba_mikk/freq_tables/all.txt", quote = F, sep = "\t", row.names = F)
Send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/abba_baba_mikk/freq_tables/all.txt ~/Documents/Data/20200910_abbababa_test/
Following the tutorial here: https://github.com/simonhmartin/tutorials/tree/master/ABBA_BABA_whole_genome
Download files to local:
mkcd abbababa_tutorial
mkcd abbababa_tutorial/data
wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_whole_genome/data/hel92.DP8MP4BIMAC2HET75dist250.geno.gz
wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_whole_genome/data/hel92.pop.txt
cd ..
wget https://github.com/simonhmartin/genomics_general/archive/master.zip
unzip master.zip
# Read in data
freq_table <- read.table("~/Documents/Data/20200910_abbababa_test/all.txt",
header = T,
as.is = T)
# Create D.stat function
D.stat <- function(p1, p2, p3) {
ABBA <- (1 - p1) * p2 * p3
BABA <- p1 * (1 - p2) * p3
(sum(ABBA) - sum(BABA)) / (sum(ABBA) + sum(BABA))
}
# Define populations
# We're asking about whether there is introgression between MIKK and HdrR, so P1 must be HNI, the population that is allopatric with MIKK.
P1 <- "hni"
P2 <- "hdrr"
P3 <- "mikk"
pops <- c(P1, P2, P3)
# Select only those populations and remove NAs
freq_table_filt <- freq_table %>%
dplyr::select(chr, pos, ancestral, derived, all_of(pops)) %>%
tidyr::drop_na()
D <- D.stat(freq_table_filt[,P1], freq_table_filt[,P2], freq_table_filt[,P3])
print(paste("D =", round(D,4)))
# D = 0.7527
The tutorial only gets a D score of 0.3821! This indicates an excess of ABBA over BABA, and therefore that MIKK shares more genetic variation with HdrR than with HNI.
But we don’t know whether this result is statistically robust, because if it results from odd ancestry at just one part of the genome, we would have less confidence that there has been significant introgression.
To test for a consistent genome-wide signal, we use a block-jackknife procedure.
This allows us to compute the variance of D despite non-independence among sites.
# get R functions
source("~/Documents/Repositories/abbababa_tutorial/genomics_general-master/jackknife.R")
# Define blocks
block_indices <- get_block_indices(block_size=1e6,
positions=freq_table_filt$pos,
chromosomes=freq_table_filt$chr)
# Have a play with code
## Set variables
block_size <- 1e6
positions <- freq_table_filt$pos
chromosomes <- freq_table_filt$chr
## Get chromosome names
chrom_names <- unique(chromosomes)
block_starts <- lapply(chrom_names, function(chrom_name) {
seq(min(positions[chromosomes==chrom_name]),
max(positions[chromosomes==chrom_name]),
block_size)
})
block_chroms <- unlist(lapply(1:length(block_starts), function(x){
rep(chrom_names[x],
length(block_starts[[x]]))
})
)
block_starts <- unlist(block_starts)
block_ends <- block_starts + block_size - 1
block_indices <- lapply(1:length(block_starts), function(x){
which(chromosomes == block_chroms[x] &
positions >= block_starts[x] &
positions <= block_ends[x])
})
# remove blank entries (added)
block_indices <- block_indices[lapply(block_indices, length) > 0 ]
n_blocks <- length(block_indices)
print(paste("Genome divided into", n_blocks, "blocks."))
Compute the standard deviation of D.
D_sd <- get_jackknife_sd(block_indices=block_indices,
FUN=D.stat,
freq_table_filt[,P1], freq_table_filt[,P2], freq_table_filt[,P3])
print(paste("D standard deviation = ", round(D_sd,4)))
# 0.161 for the tutorial
D_err <- D_sd/sqrt(n_blocks)
D_Z <- D / D_err
print(paste("D Z score = ", round(D_Z,3)))
# 40.003 for the tutorial
The massive Z score in this case means the deviation from 0 is hugely significant.
Now we want to quantify the proportion of the genome that has been shared, denoted by the f statistic.
The idea behind this approach is to compare the observed excess of ABBA over BABA, to that which would be expected under complete admixture.
Now we need a second population of the P3 species in place of P2.
bcftools query -l vcfs/panel_no-sibs_line-ids_no-missing_bi-snps_with-af.vcf.gz > mikk_genome/data/20200914_samples_no-sibs.txt
# randomly assign population to group 1 or 2
samples <- scan("mikk_genome/data/20200914_samples_no-sibs.txt", what = "chr")
set.seed(54)
pop_1 <- sample(samples, 32)
pop_2 <- samples[!samples %in% pop_1]
write(pop_1, "mikk_genome/data/20200914_abbababa_mikk-1.txt")
write(pop_2, "mikk_genome/data/20200914_abbababa_mikk-2.txt")
# split samples into two VCFs
for i in $(seq 1 2 ); do
bcftools view \
--min-alleles 2 \
--max-alleles 2\
--types snps \
--samples-file mikk_genome/data/20200914_abbababa_mikk-$i.txt \
--output-type u \
vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz | \
bcftools +fill-tags \
--output-type z \
--output vcfs/panel_no-sibs_line-ids_no-missing_bi-snps_with-af_$i.vcf.gz \
-- \
--tags AF;
done
# get allele freqs for each subpop
for i in $(seq 1 2 ); do
bcftools query \
--format '%CHROM\t%POS\t%REF\t%ALT\t%INFO/AF\n' \
--output-file maf/20200915_af_$i.txt \
vcfs/panel_no-sibs_line-ids_no-missing_bi-snps_with-af_$i.vcf.gz;
done
# Split by chromosome and paste together chr:pos columns
for i in $(seq 1 2 ); do
mkdir maf/20200915_split-by-chr_$i;
done
for i in $(seq 1 2 ); do
for j in $(cut -f1 maf/20200915_af_$i.txt | sort | uniq); do
awk "\$1 == $j" maf/20200915_af_$i.txt | \
awk -v OFS="\t" '{print $1":"$2, $0}' \
> maf/20200915_split-by-chr_$i/$j.txt ;
done;
done
# Combine AF and EMF files
mkdir abba_baba_mikk/freq_tables_mikk_1
mkdir abba_baba_mikk/freq_tables_mikk_2
for i in $( seq 1 2 ); do
for j in $( seq 1 24 ); do
dat_file=$( echo emfs/final/$j.txt );
af_file=$( echo maf/20200915_split-by-chr_$i/$j.txt );
bsub \
-M 30000 \
-o log/20200915_frq_tbl_$i\_$j.out \
-e log/20200915_frq_tbl_$i\_$j.err \
"Rscript --vanilla mikk_genome/code/scripts/20200911_combine_final_and_af_data.R $j $dat_file $af_file abba_baba_mikk/freq_tables_mikk_$i ";
done;
done
Combine chr files into single file directly on cluster
group_a <- list.files("abba_baba_mikk/freq_tables_mikk_1",
pattern = "[0-9]",
full.names = T)
group_b <- list.files("abba_baba_mikk/freq_tables_mikk_2",
pattern = "[0-9]",
full.names = T)
# read in data
dat_list_a <- lapply(group_a, function(x){
# read in data
df <- readr::read_tsv(x,
col_types = "iiccdiiiii") %>%
dplyr::rename(mikk_a = mikk)
return(df)
})
dat_list_b <- lapply(group_b, function(x){
# read in data
df <- readr::read_tsv(x,
col_types = "iiccdiiiii") %>%
dplyr::rename(mikk_b = mikk)
return(df)
})
# bind together and sort
final_df_a <- dplyr::bind_rows(dat_list_a) %>% # bind into single DF
dplyr::arrange(chr, pos) # sort by chromosome, then position
final_df_b <- dplyr::bind_rows(dat_list_b) %>% # bind into single DF
dplyr::arrange(chr, pos) # sort by chromosome, then position
# read in previous final data
final_df_prev <- readr::read_tsv("abba_baba_mikk/freq_tables/all.txt",
col_types = "iiccdiiiii")
# join
final_df <- final_df_prev %>%
dplyr::full_join(dplyr::select(final_df_a, chr, pos, mikk_a),
by = c("chr", "pos")) %>%
dplyr::full_join(dplyr::select(final_df_b, chr, pos, mikk_b),
by = c("chr", "pos")) %>%
dplyr::select(chr, pos, ancestral, derived, mikk, mikk_a, mikk_b, everything())
# write table
write.table(final_df, "abba_baba_mikk/freq_tables/all_with-split-mikk.txt", quote = F, sep = "\t", row.names = F)
Send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/abba_baba_mikk/freq_tables/all_with-split-mikk.txt ~/Documents/Data/20200910_abbababa_test
Read in again
freq_table <- read.table("~/Documents/Data/20200910_abbababa_test/all_with-split-mikk.txt",
header = T,
as.is = T)
# Create f-stat function
f.stat <- function(p1, p2, p3a, p3b) {
ABBA_numerator <- (1 - p1) * p2 * p3a
BABA_numerator <- p1 * (1 - p2) * p3a
ABBA_denominator <- (1 - p1) * p3b * p3a
BABA_denominator <- p1 * (1 - p3b) * p3a
(sum(ABBA_numerator) - sum(BABA_numerator)) /
(sum(ABBA_denominator) - sum(BABA_denominator))
}
# Choose our P3a and P3b populations
## Note the changes in P1 and P2
P1 <- "hni"
P2 <- "hdrr"
P3 <- "mikk"
P3a <- "mikk_a"
P3b <- "mikk_b"
# Remove NAs again
pops <- c(P1, P2, P3, P3a, P3b)
# Select only those populations and remove NAs
freq_table_filt <- freq_table %>%
dplyr::select(chr, pos, ancestral, derived, all_of(pops)) %>%
tidyr::drop_na()
f <- f.stat(freq_table_filt[,P1], freq_table_filt[,P2], freq_table_filt[,P3a], freq_table_filt[,P3b])
print(paste("Admixture proportion = ", round(f,4)))
# 0.2743 for the tutorial
# "Admixture proportion = 0.6281"
This reveals that 62.81% of the genome is shared between HdrR and MIKK. This can be interpreted as the average proportion of shared ancestry with HdrR in any single MIKK genome.
We can again use the block jackknife to estimate the standard deviation of f, and obtain a confidence interval.
f_sd <- get_jackknife_sd(block_indices=block_indices,
FUN=f.stat,
freq_table_filt[,P1],
freq_table_filt[,P2],
freq_table_filt[,P3a],
freq_table_filt[,P3b])
f_err <- f_sd/sqrt(n_blocks)
f_CI_lower <- f - 1.96*f_err
f_CI_upper <- f + 1.96*f_err
print(paste("95% confidence interval of f =", round(f_CI_lower,4), round(f_CI_upper,4)))
Do all chromosomes show evidence of introgression?
# Get row indices for each chromosome
chrom_names <- unique(freq_table_filt$chr)
chrom_indices <- lapply(chrom_names, function(chrom) which(freq_table_filt$chr == chrom))
names(chrom_indices) <- chrom_names
# Get number of SNPs in each chromosome
sapply(chrom_indices, length)
# Use these indices to compute a D value for each chromosome
D_by_chrom <- sapply(chrom_names, function(chrom) {
D.stat(freq_table_filt[chrom_indices[[chrom]], P1],
freq_table_filt[chrom_indices[[chrom]], P2],
freq_table_filt[chrom_indices[[chrom]], P3])
})
# Jackknife
block_indices_by_chrom <- lapply(chrom_names, function(chrom){
block_indices <- get_block_indices(block_size=1e6,
positions=freq_table_filt$pos[freq_table_filt$chr==chrom])
# remove empty blocks
block_indices <- block_indices[lapply(block_indices, length) > 0 ]
return(block_indices)
})
names(block_indices_by_chrom) <- chrom_names
# Check number of blocks per chr
sapply(block_indices_by_chrom, length)
lapply(block_indices_by_chrom, sapply, length)
# Compute Z scores for each chromosome
D_sd_by_chrom <- sapply(chrom_names, function(chrom){
get_jackknife_sd(block_indices=block_indices_by_chrom[[chrom]],
FUN=D.stat,
freq_table_filt[chrom_indices[[chrom]], P1],
freq_table_filt[chrom_indices[[chrom]], P2],
freq_table_filt[chrom_indices[[chrom]], P3])
})
D_err_by_chrom <- D_sd_by_chrom / sqrt(sapply(block_indices_by_chrom, length))
D_Z_by_chrom <- D_by_chrom / D_err_by_chrom
D_Z_by_chrom
Lowest Z score is 4.3 (chr 20).
Instead of using the allopatric mel_mel as P1, we can use the physically closer and more closely related mel_vul. Do we still see a significant D value and large admixture proportion? If not, why?
P1 <- "mel_mel"
P2 <- "mel_ros"
P3 <- "cyd_chi"
# D = 0.3821
# Genome divided into 284 blocks.
# D standard deviation = 0.161
# D Z score = 40.003
# Admixture proportion = 0.2743
# 95% confidence interval of f = 0.2551 0.2936
P1 <- "mel_vul"
P2 <- "mel_ros"
P3 <- "cyd_chi"
# D = -0.0307
# Genome divided into 284 blocks.
# D standard deviation = 0.0805
# D Z score = -6.42
# Admixture proportion = -0.0259
# 95% confidence interval of f = -0.0337 -0.0182
We still see a significant D value, but it’s negative! This suggests that there is more BABA than ABBA, and therefore P3 (cyd_chi) is more closely related to mel_vul than it is to mel_ros?
Does this suggest that we need to find use the most allopatric species as P1? In our case that would be melastigma (or javanicus).
Make function to run repeatedly.
source("~/Documents/Repositories/mikk_genome/code/scripts/20200916_abbababa_functions.R")
# read in data
freq_table <- read.table("~/Documents/Data/20200910_abbababa_test/all_with-split-mikk.txt",
header = T,
as.is = T)
run_abbababa <- function(data, P1, P2){
# set popns and remove NAs
P3 <- "mikk"
P3a <- "mikk_a"
P3b <- "mikk_b"
pops <- c(P1, P2, P3, P3a, P3b)
# Select only those populations and remove NAs
freq_table <- data %>%
dplyr::select(chr, pos, ancestral, derived, all_of(pops)) %>%
tidyr::drop_na()
# Create output list
out_list <- list()
# Add populations
out_list[["Populations"]] <- list("P1" = P1,
"P2" = P2,
"P3" = "mikk")
# Get genome-wide D stat
out_list[["Genome-wide D"]] <- list()
D <- D.stat(freq_table[,P1], freq_table[,P2], freq_table[,P3])
out_list[["Genome-wide D"]][["D statistic"]] <- D
block_indices <- get_block_indices(block_size=1e6, # Block jackknife to obtain SD and Z-score
positions=freq_table$pos,
chromosomes=freq_table$chr)
block_indices <- block_indices[lapply(block_indices, length) > 0 ] # remove empty entries
n_blocks <- length(block_indices)
out_list[["Genome-wide D"]][["Number of blocks"]] <- n_blocks
D_sd <- get_jackknife_sd(block_indices=block_indices, # get D SD
FUN=D.stat,
freq_table[,P1], freq_table[,P2], freq_table[,P3])
out_list[["Genome-wide D"]][["Standard deviation"]] <- D_sd
D_err <- D_sd/sqrt(n_blocks)
D_Z <- D / D_err
out_list[["Genome-wide D"]][["Z score"]] <- D_Z
# Get admixture proportion
out_list[["Admixture"]] <- list()
f <- f.stat(freq_table[,P1], freq_table[,P2], freq_table[,P3a], freq_table[,P3b])
out_list[["Admixture"]][["f statistic"]] <- f
f_sd <- get_jackknife_sd(block_indices=block_indices,
FUN=f.stat,
freq_table[,P1], freq_table[,P2], freq_table[,P3a], freq_table[,P3b])
out_list[["Admixture"]][["Standard deviation"]] <- f_sd
f_err <- f_sd/sqrt(n_blocks)
f_CI_lower <- f - 1.96*f_err
f_CI_upper <- f + 1.96*f_err
out_list[["Admixture"]][["Confidence interval"]] <- list("lower" = f_CI_lower,
"upper" = f_CI_upper)
# Get per-chromosome D
chrom_names <- unique(freq_table$chr)
chrom_indices <- lapply(chrom_names, function(chrom) which(freq_table$chr == chrom))
names(chrom_indices) <- chrom_names
out_list[["Per-chromosome D"]] <- lapply(chrom_names, function(chrom){
per_chr_out <- list()
# get D stat
D_by_chrom <- D.stat(freq_table[chrom_indices[[chrom]], P1],
freq_table[chrom_indices[[chrom]], P2],
freq_table[chrom_indices[[chrom]], P3])
per_chr_out[["D statistic"]] <- D_by_chrom
# number of SNPs
per_chr_out[["Number of SNPs"]] <- length(chrom_indices[[chrom]])
# get block indices
block_indices_by_chrom <- get_block_indices(block_size=1e6,
positions=freq_table$pos[freq_table$chr == chrom])
# remove empty blocks
block_indices_by_chrom <- block_indices_by_chrom[lapply(block_indices_by_chrom, length) > 0 ]
# Get SD
D_sd_by_chrom <- get_jackknife_sd(block_indices=block_indices_by_chrom,
FUN=D.stat,
freq_table[chrom_indices[[chrom]], P1],
freq_table[chrom_indices[[chrom]], P2],
freq_table[chrom_indices[[chrom]], P3])
per_chr_out[["Standard error"]] <- D_sd_by_chrom
# Get Z score
D_err_by_chrom <- D_sd_by_chrom / sqrt(length(block_indices_by_chrom))
D_Z_by_chrom <- D_by_chrom / D_err_by_chrom
per_chr_out[["Z score"]] <- D_Z_by_chrom
# Return list
return(per_chr_out)
})
names(out_list[["Per-chromosome D"]]) <- chrom_names
# return output
return(out_list)
}
test_function <- function(data, P1, P2){
# set popns and remove NAs
P3 <- "mikk"
P3a <- "mikk_a"
P3b <- "mikk_b"
pops <- c(P1, P2, P3, P3a, P3b)
# Select only those populations and remove NAs
freq_table <- data %>%
dplyr::select(chr, pos, ancestral, derived, all_of(pops)) %>%
tidyr::drop_na()
out_list <- list()
# Get per-chromosome D
#out_list[["Per-chromosome D"]] <- list()
chrom_names <- unique(freq_table$chr)
chrom_indices <- lapply(chrom_names, function(chrom) which(freq_table$chr == chrom))
names(chrom_indices) <- chrom_names
out_list[["Per-chromosome D"]] <- lapply(chrom_names, function(chrom){
per_chr_out <- list()
# get D stat
D_by_chrom <- D.stat(freq_table[chrom_indices[[chrom]], P1],
freq_table[chrom_indices[[chrom]], P2],
freq_table[chrom_indices[[chrom]], P3])
per_chr_out[["D statistic"]] <- D_by_chrom
# number of SNPs
per_chr_out[["Number of SNPs"]] <- length(chrom_indices[[chrom]])
# get block indices
block_indices_by_chrom <- get_block_indices(block_size=1e6,
positions=freq_table$pos[freq_table$chr == chrom])
# remove empty blocks
block_indices_by_chrom <- block_indices_by_chrom[lapply(block_indices_by_chrom, length) > 0 ]
# Get SD
D_sd_by_chrom <- get_jackknife_sd(block_indices=block_indices_by_chrom,
FUN=D.stat,
freq_table[chrom_indices[[chrom]], P1],
freq_table[chrom_indices[[chrom]], P2],
freq_table[chrom_indices[[chrom]], P3])
per_chr_out[["Standard error"]] <- D_sd_by_chrom
# Get Z score
D_err_by_chrom <- D_sd_by_chrom / sqrt(length(block_indices_by_chrom))
D_Z_by_chrom <- D_by_chrom / D_err_by_chrom
per_chr_out[["Z score"]] <- D_Z_by_chrom
# Return list
return(per_chr_out)
})
names(out_list[["Per-chromosome D"]]) <- chrom_names
# return output
return(out_list)
}
test_perchr <- test_function(data = freq_table,
P1 = "hni",
P2 = "hdrr")
# WORKS
test <- run_abbababa(data = freq_table,
P1 = "hni",
P2 = "hdrr")
# Save list
save(test, file = "~/Documents/Data/20200910_abbababa_test/p1-hni_p2_hdrr.rds")
# TEST
mela_p2s <- c("hdrr", "hni", "hsok", "javanicus")
melastigma <- lapply(c("hdrr", "hni", "hsok", "javanicus"), function(P2){
out <- run_abbababa(data = freq_table,
P1 = "melastigma",
P2 = P2)
return(out)
})
names(melastigma) <- mela_p2s
p1s <- c("hni", "hsok", "javanicus", "melastigma")
p2s <- c("hdrr", "hni", "hsok", "javanicus", "melastigma")
all <- lapply(p1s, function(p1){
p2_list <- lapply(p2s, function(p2){
if (p1 != p2){
out <- run_abbababa(data = freq_table,
P1 = p1,
P2 = p2)
return(out)
}
})
names(p2_list) <- p2s
return(p2_list)
})
names(all) <- p1s
# Take ages, but works
# Remove empty rows
final_list <- lapply(all, function(p1){
out <- p1[lapply(p1, length) > 0]
return(out)
})
# Reduce down to DF
final_df <- lapply(final_list, function(p1){
out <- lapply(p1, function(p2){
# Get D stats
g_wide_d <- p2[["Genome-wide D"]][["D statistic"]]
per_chr_d <- sapply(p2[["Per-chromosome D"]],
function(x) x[["D statistic"]])
# Get Z scores
g_wide_z <- p2[["Genome-wide D"]][["Z score"]]
per_chr_z <- sapply(p2[["Per-chromosome D"]],
function(x) x[["Z score"]])
# Get admixture
admix_f <- p2[["Admixture"]][["f statistic"]]
# Create data frame
df_out <- data.frame("chr" = c("genome_wide", names(per_chr_d)),
"d_stat" = c(g_wide_d, per_chr_d),
"z_score" = c(g_wide_z, per_chr_z),
"admix_f" = c(admix_f, rep(NA, length(per_chr_z))))
return(df_out)
})
# bind rows into single DF
out <- dplyr::bind_rows(out, .id = "p2")
return(out)
})
final_df <- dplyr::bind_rows(final_df, .id = "p1")
# Admixture
final_df %>% ggplot() +
geom_col(aes(p2, admix_f, fill = p1),
position = "dodge")
# Z score per chromosome
final_df %>% ggplot() +
geom_col(aes(p2, z_score, fill = p1),
position = "dodge") +
facet_wrap(~chr)
run_abbababa function to get f-statistic per chromosomerun_abbababa <- function(data, P1, P2){
# set popns and remove NAs
P3 <- "mikk"
P3a <- "mikk_a"
P3b <- "mikk_b"
pops <- c(P1, P2, P3, P3a, P3b)
# Select only those populations and remove NAs
freq_table <- data %>%
dplyr::select(chr, pos, ancestral, derived, all_of(pops)) %>%
tidyr::drop_na()
# Create output list
out_list <- list()
# Add populations
out_list[["Populations"]] <- list("P1" = P1,
"P2" = P2,
"P3" = "mikk")
# Get genome-wide D stat
out_list[["Genome-wide D"]] <- list()
D <- D.stat(freq_table[,P1], freq_table[,P2], freq_table[,P3])
out_list[["Genome-wide D"]][["D statistic"]] <- D
block_indices <- get_block_indices(block_size=1e6, # Block jackknife to obtain SD and Z-score
positions=freq_table$pos,
chromosomes=freq_table$chr)
block_indices <- block_indices[lapply(block_indices, length) > 0 ] # remove empty entries
n_blocks <- length(block_indices)
out_list[["Genome-wide D"]][["Number of blocks"]] <- n_blocks
D_sd <- get_jackknife_sd(block_indices=block_indices, # get D SD
FUN=D.stat,
freq_table[,P1], freq_table[,P2], freq_table[,P3])
out_list[["Genome-wide D"]][["Standard deviation"]] <- D_sd
D_err <- D_sd/sqrt(n_blocks)
D_Z <- D / D_err
out_list[["Genome-wide D"]][["Z score"]] <- D_Z
# Get admixture proportion
out_list[["Admixture"]] <- list()
f <- f.stat(freq_table[,P1], freq_table[,P2], freq_table[,P3a], freq_table[,P3b])
out_list[["Admixture"]][["f statistic"]] <- f
f_sd <- get_jackknife_sd(block_indices=block_indices,
FUN=f.stat,
freq_table[,P1], freq_table[,P2], freq_table[,P3a], freq_table[,P3b])
out_list[["Admixture"]][["Standard deviation"]] <- f_sd
f_err <- f_sd/sqrt(n_blocks)
f_CI_lower <- f - 1.96*f_err
f_CI_upper <- f + 1.96*f_err
out_list[["Admixture"]][["Confidence interval"]] <- list("lower" = f_CI_lower,
"upper" = f_CI_upper)
# Get per-chromosome states
chrom_names <- unique(freq_table$chr)
chrom_indices <- lapply(chrom_names, function(chrom) which(freq_table$chr == chrom))
names(chrom_indices) <- chrom_names
out_list[["Per-chromosome"]] <- lapply(chrom_names, function(chrom){
per_chr_out <- list()
# get D stat
D_by_chrom <- D.stat(freq_table[chrom_indices[[chrom]], P1],
freq_table[chrom_indices[[chrom]], P2],
freq_table[chrom_indices[[chrom]], P3])
per_chr_out[["D statistic"]] <- D_by_chrom
# number of SNPs
per_chr_out[["Number of SNPs"]] <- length(chrom_indices[[chrom]])
# get block indices
block_indices_by_chrom <- get_block_indices(block_size=1e6,
positions=freq_table$pos[freq_table$chr == chrom])
# remove empty blocks
block_indices_by_chrom <- block_indices_by_chrom[lapply(block_indices_by_chrom, length) > 0 ]
# Get SD
D_sd_by_chrom <- get_jackknife_sd(block_indices=block_indices_by_chrom,
FUN=D.stat,
freq_table[chrom_indices[[chrom]], P1],
freq_table[chrom_indices[[chrom]], P2],
freq_table[chrom_indices[[chrom]], P3])
per_chr_out[["Standard error"]] <- D_sd_by_chrom
# Get Z score
D_err_by_chrom <- D_sd_by_chrom / sqrt(length(block_indices_by_chrom))
D_Z_by_chrom <- D_by_chrom / D_err_by_chrom
per_chr_out[["Z score"]] <- D_Z_by_chrom
# Get admixture
per_chr_out[["Admixture"]] <- list()
f_by_chrom <- f.stat(freq_table[chrom_indices[[chrom]],P1],
freq_table[chrom_indices[[chrom]],P2],
freq_table[chrom_indices[[chrom]],P3a],
freq_table[chrom_indices[[chrom]],P3b])
per_chr_out[["Admixture"]][["f statistic"]] <- f_by_chrom
f_sd_by_chrom <- get_jackknife_sd(block_indices=block_indices_by_chrom,
FUN=f.stat,
freq_table[chrom_indices[[chrom]],P1],
freq_table[chrom_indices[[chrom]],P2],
freq_table[chrom_indices[[chrom]],P3a],
freq_table[chrom_indices[[chrom]],P3b])
per_chr_out[["Admixture"]][["Standard deviation"]] <- f_sd_by_chrom
f_err_by_chrom <- f_sd_by_chrom / sqrt(length(block_indices_by_chrom))
f_CI_lower_by_chrom <- f - 1.96*f_err_by_chrom
f_CI_upper_by_chrom <- f + 1.96*f_err_by_chrom
per_chr_out[["Admixture"]][["Confidence interval"]] <- list("lower" = f_CI_lower_by_chrom,
"upper" = f_CI_upper_by_chrom)
# Return list
return(per_chr_out)
})
names(out_list[["Per-chromosome"]]) <- chrom_names
# return output
return(out_list)
}
test_p1_hni_p2_hdrr <- run_abbababa(data = freq_table,
P1 = "hni",
P2 = "hdrr")
# Pull out per-chromosome
sapply(test_p1_hni_p2_hdrr$`Per-chromosome`, function(x) x$Admixture$`f statistic`)
p1s <- c("hdrr", "hni", "hsok", "javanicus", "melastigma")
p2s <- p1s
full_ab_list <- lapply(p1s, function(p1){
p2_list <- lapply(p2s, function(p2){
if (p1 != p2){
out <- run_abbababa(data = freq_table,
P1 = p1,
P2 = p2)
return(out)
}
})
names(p2_list) <- p2s
return(p2_list)
})
names(full_ab_list) <- p1s
saveRDS(full_ab_list, file = "~/Documents/Repositories/mikk_genome/data/20200917_abba_out.RData")
test <- readRDS("~/Documents/Repositories/mikk_genome/data/20200917_abba_out.RData")
mkdir abba_baba_mikk/20200925
for i in $(echo hdrr hni hsok javanicus melastigma ); do
for j in $(echo hdrr hni hsok javanicus melastigma ); do
# don't run if they are the same
if [ $i != $j ]; then
p1=$i;
p2=$j;
dat_file=$(echo abba_baba_mikk/freq_tables/all_with-split-mikk.txt);
dir_out=$(echo abba_baba_mikk/20200925 );
bsub \
-M 10000 \
-o log/20200925_abbababa_$p1\_$p2.out \
-e log/20200925_abbababa_$p1\_$p2.err \
"Rscript --vanilla mikk_genome/code/scripts/20200916_run_abbababa.R $p1 $p2 $dat_file $dir_out "
fi
done;
done
# get files
files <- list.files("abba_baba_mikk/20200925",
full.names = T)
# get population names
pop_names <- basename(files) %>%
stringr::str_split(., "\\.", simplify = T) %>%
subset(select = 1) %>%
stringr::str_split("_", simplify = T)
# load into list
final_list <- lapply(unique(pop_names[, 1]), function(P1){
p1_inds <- which(pop_names[, 1] %in% P1 )
P2 <- lapply(p1_inds, function(file_ind){
readRDS(files[file_ind])
})
names(P2) <- pop_names[p1_inds, 2]
return(P2)
})
names(final_list) <- unique(pop_names[, 1])
# save
saveRDS(final_list, "abba_baba_mikk/20200925_all/all.RDS")
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/abba_baba_mikk/20200925_all/all.RDS ~/Documents/Data/20200910_abbababa_test/20200925.RDS
# Load data
final_list <- readRDS("~/Documents/Data/20200910_abbababa_test/20200925.RDS")
# Reduce down to DF
final_df <- lapply(final_list, function(p1){
out <- lapply(p1, function(p2){
# Get D stats
g_wide_d <- p2[["Genome-wide D"]][["D statistic"]]
per_chr_d <- sapply(p2[["Per-chromosome"]],
function(x) x[["D statistic"]])
# Get Z scores
g_wide_z <- p2[["Genome-wide D"]][["Z score"]]
per_chr_z <- sapply(p2[["Per-chromosome"]],
function(x) x[["Z score"]])
# Get admixture
admix_f <- p2[["Admixture"]][["f statistic"]]
per_chr_f <- sapply(p2[["Per-chromosome"]],
function(x) x[["Admixture"]][["f statistic"]])
# Create data frame
df_out <- data.frame("chr" = c("all", names(per_chr_d)),
"d_stat" = c(g_wide_d, per_chr_d),
"z_score" = c(g_wide_z, per_chr_z),
"admix_f" = c(admix_f, per_chr_f))
return(df_out)
})
# bind rows into single DF
out <- dplyr::bind_rows(out, .id = "p2")
return(out)
})
final_df <- dplyr::bind_rows(final_df, .id = "p1")
# tidy up factors, etc for data
final_df$chr <- factor(final_df$chr, levels = c(seq(1, 24), "all"))
## capitalise species/line names
final_df <- final_df %>%
dplyr::mutate(across(c("p1", "p2"),
~dplyr::if_else(.x %in% c("hni", "hsok"),
str_to_upper(.x),
dplyr::if_else(.x == "hdrr",
"HdrR",
.x)))) %>%
dplyr::rename(P1 = p1,
P2 = p2)
final_df %>%
dplyr::filter(p1 %in% c("javanicus", "melastigma") & p2 %in% c("HdrR", "HNI", "HSOK")) %>%
tidyr::pivot_wider(id_cols = c("p2", "chr"),
names_from = p1,
values_from = admix_f) %>%
ggplot() +
geom_point(aes(melastigma,
javanicus,
colour = chr,
shape = p2))
Plot
final_df %>%
dplyr::filter(chr == "all") %>%
ggplot() +
geom_col(aes(p2, admix_f, fill = p1),
position = "dodge")
Looks like javanicus is the most unlike MIKK. Use them as the outgroup
final_df %>%
dplyr::filter(chr == "genome_wide",
p1 == "javanicus") %>%
ggplot() +
geom_col(aes(p2, admix_f, fill = p2)) +
guides(fill = F)
Facet by chromosome
final_df %>%
dplyr::filter(p1 == "javanicus") %>%
ggplot() +
geom_col(aes(p2, admix_f, fill = p2)) +
guides(fill = F) +
facet_wrap(~chr)
# CLEAN
mkdir emfs/cleaned_jap-med-only
for i in $(find emfs/segmented/*/*_1.data.txt ); do # takes only forward strand sequences
# make new directory
chr_file=$(echo $i | cut -f3 -d"/" );
new_path=$(echo emfs/cleaned_jap-med-only/$chr_file );
if [ ! -d "$new_path" ]; then
mkdir $new_path;
fi
# create name for new file
bname=$(echo $i | sed 's/.data.txt//g' );
bname_short=$(basename $bname );
bsub \
-M 10000 \
-o log/20200929_clean_jap_$chr_file\_$bname_short.out \
-e log/20200929_clean_jap_$chr_file\_$bname_short.err \
"Rscript --vanilla mikk_genome/code/scripts/20200921_add-hdrr-coords-to-emf-data_jap-med.R $bname $new_path";
done
# how many successfully completed?
grep "Execution halted" log/20200929*
# All except for 17_1
Consolidate
mkdir emfs/consolidated_jap-med-only
# run for each sub-chr (e.g. 10_1, 10_2)
for i in $(find emfs/cleaned_jap-med-only/* -type d ); do
# get chromosome
chr=$(echo $i | cut -f3 -d"/" );
# consolidate into one file
bsub \
-M 20000 \
-o log/20200928_consol_$chr.out \
-e log/20200928_consol_$chr.err \
"Rscript --vanilla mikk_genome/code/scripts/20200910_consolidate_chr_dat.R $i emfs/consolidated_jap-med-only/$chr.txt ";
done
# combine for each chr
mkdir emfs/final_jap-med-only
for i in $(find emfs/consolidated_jap-med-only/* | cut -f3 -d"/" | cut -f1 -d"_" | sort | uniq ); do
dir_in=$(echo emfs/consolidated_jap-med-only );
dir_out=$(echo emfs/final_jap-med-only );
bsub \
-M 20000 \
-o log/20200928_full_chr_$i.out \
-e log/20200928_full_chr_$i.err \
"Rscript --vanilla mikk_genome/code/scripts/20200928_consolidate_chr_files_jap-med.R $i $dir_in $dir_out ";
done
Add MIKK frequencies
mkdir abba_baba_mikk/freq_tables_jap-med-only
for i in $( seq 1 24 ); do
dat_file=$( echo emfs/final_jap-med-only/$i.txt );
af_file=$( echo maf/20200910_split-by-chr/$i.txt );
dir_out=$( echo abba_baba_mikk/freq_tables_jap-med-only );
bsub \
-M 30000 \
-o log/20200928_frq_tbl_$i.out \
-e log/20200928_frq_tbl_$i.err \
"Rscript --vanilla mikk_genome/code/scripts/20200929_combine_final_and_af_data_jap-med.R $i $dat_file $af_file $dir_out ";
done
Combine chr files into single file - run directly on cluster
files <- list.files("abba_baba_mikk/freq_tables_jap-med-only",
pattern = "[0-9]",
full.names = T)
# read in data
dat_list <- lapply(files, function(x){
# read in data
df <- readr::read_tsv(x,
col_types = "iiccdiii")
return(df)
})
# bind together and sort
final_df <- dplyr::bind_rows(dat_list) %>% # bind into single DF
dplyr::arrange(chr, pos) # sort by chromosome, then position
# write table
write.table(final_df, "abba_baba_mikk/freq_tables_jap-med-only/all.txt", quote = F, sep = "\t", row.names = F)
Do the same for split MIKK
mkdir abba_baba_mikk/freq_tables_mikk_1_jap-med-only
mkdir abba_baba_mikk/freq_tables_mikk_2_jap-med-only
for i in $( seq 1 2 ); do
for j in $( seq 1 24 ); do
dat_file=$( echo emfs/final_jap-med-only/$j.txt );
af_file=$( echo maf/20200915_split-by-chr_$i/$j.txt );
dir_out=$( echo abba_baba_mikk/freq_tables_mikk_$i\_jap-med-only );
bsub \
-M 30000 \
-o log/20200929_frq_tbl_split_$i\_$j.out \
-e log/20200929_frq_tbl_split_$i\_$j.err \
"Rscript --vanilla mikk_genome/code/scripts/20200929_combine_final_and_af_data_jap-med.R $j $dat_file $af_file $dir_out ";
done;
done
Combine chr files into single file directly on cluster
group_a <- list.files("abba_baba_mikk/freq_tables_mikk_1_jap-med-only",
pattern = "[0-9]",
full.names = T)
group_b <- list.files("abba_baba_mikk/freq_tables_mikk_2_jap-med-only",
pattern = "[0-9]",
full.names = T)
# read in data
dat_list_a <- lapply(group_a, function(x){
# read in data
df <- readr::read_tsv(x,
col_types = "iiccdiii") %>%
dplyr::rename(mikk_a = mikk)
return(df)
})
dat_list_b <- lapply(group_b, function(x){
# read in data
df <- readr::read_tsv(x,
col_types = "iiccdiii") %>%
dplyr::rename(mikk_b = mikk)
return(df)
})
# bind together and sort
final_df_a <- dplyr::bind_rows(dat_list_a) %>% # bind into single DF
dplyr::arrange(chr, pos) # sort by chromosome, then position
final_df_b <- dplyr::bind_rows(dat_list_b) %>% # bind into single DF
dplyr::arrange(chr, pos) # sort by chromosome, then position
# read in previous final data
final_df_prev <- readr::read_tsv("abba_baba_mikk/freq_tables_jap-med-only/all.txt",
col_types = "iiccdiii")
# join
final_df <- final_df_prev %>%
dplyr::full_join(dplyr::select(final_df_a, chr, pos, mikk_a),
by = c("chr", "pos")) %>%
dplyr::full_join(dplyr::select(final_df_b, chr, pos, mikk_b),
by = c("chr", "pos")) %>%
dplyr::select(chr, pos, ancestral, derived, mikk, mikk_a, mikk_b, everything())
# write table
write.table(final_df, "abba_baba_mikk/freq_tables_jap-med-only/all_with-split-mikk.txt", quote = F, sep = "\t", row.names = F)
Run ABBA BABA
mkdir abba_baba_mikk/20200929
for i in $(echo hdrr hni hsok ); do
for j in $(echo hdrr hni hsok ); do
# don't run if they are the same
if [ $i != $j ]; then
p1=$i;
p2=$j;
dat_file=$(echo abba_baba_mikk/freq_tables_jap-med-only/all_with-split-mikk.txt );
dir_out=$(echo abba_baba_mikk/20200929 );
bsub \
-M 10000 \
-o log/20200929_abbababa_$p1\_$p2.out \
-e log/20200929_abbababa_$p1\_$p2.err \
"Rscript --vanilla mikk_genome/code/scripts/20200916_run_abbababa.R $p1 $p2 $dat_file $dir_out "
fi
done;
done
mkdir abba_baba_mikk/20200929_all
library(tidyverse)
# get files
files <- list.files("abba_baba_mikk/20200929",
full.names = T)
# get population names
pop_names <- basename(files) %>%
stringr::str_split(., "\\.", simplify = T) %>%
subset(select = 1) %>%
stringr::str_split("_", simplify = T)
# load into list
p1_names <- unique(pop_names[, 1]) # get unique names in P1(pop_names[, 1])
# put lists into nested list by P1 then P2
final_list <- lapply(p1_names, function(P1){
# get indices of all files matching the target P1
p1_inds <- which(pop_names[, 1] %in% P1 )
P2 <- lapply(p1_inds, function(file_ind){
readRDS(files[file_ind])
})
names(P2) <- pop_names[p1_inds, 2]
return(P2)
})
names(final_list) <- p1_names
# save
saveRDS(final_list, "abba_baba_mikk/20200929_all/all.RDS")
Convert both datasets (all Oryzias, and jap-med-only) into DFs
mkdir abba_baba_mikk/20200929_final
for i in $(echo 20200925 20200929 ); do
if [ $i == "20200925" ]; then
out=$(echo oryzias_all );
else
out=$(echo oryzias_latipes );
fi
file_in=$(echo abba_baba_mikk/$i\_all/all.RDS );
file_out=$(echo abba_baba_mikk/20200929_final/20200929_abbababa_final_$out.txt );
bsub \
-o log/20200930_list_to_df_$out.out \
-e log/20200930_list_to_df_$out.err \
"Rscript --vanilla mikk_genome/code/scripts/20200929_convert_final_list_to_df.R $file_in $file_out "
done
# copy to repo
cp abba_baba_mikk/20200929_final/* mikk_genome/data/
# Cut and pasted tree from emfs/README.47
# If generating the tree directly, the tips remove the spaces between Genus and species, so manually edit using regex to find spaces and replace them with "_"
# (?<=[a-z])( )(?=[a-z])
phylo_tree <- ape::read.tree(file = "~/Documents/Repositories/mikk_genome/data/20200921_47-fish-epo_tree.txt")
# Colour all Oryzias
ids <- phylo_tree$tip.label[grep("Oryzias", phylo_tree$tip.label)]
# get indices of edges descending from MRCA (determined through trial and error)
oryzias_nodes <- seq(26, 33)
all_med_col <- ifelse(1:length(phylo_tree[["edge.length"]]) %in% oryzias_nodes, "#E84141", "black")
# set colours for tip labels
all_med_tip <- ifelse(phylo_tree$tip.label %in% ids, "#E84141", "black")
# plot
ape::plot.phylo(phylo_tree,
use.edge.length = T,
edge.color = all_med_col,
tip.color = all_med_tip,
font = 4)
# Save
png(file="~/Documents/Docs/medaka pics/20200921_introgression/tree_all_oryzias.png",
width=22,
height=25,
units = "cm",
res = 400)
ape::plot.phylo(phylo_tree,
use.edge.length = T,
edge.color = all_med_col,
tip.color = all_med_tip,
font = 4)
dev.off()
# Colour all Oryzias
ids <- phylo_tree$tip.label[grep("Oryzias_latipes", phylo_tree$tip.label)]
# get indices of edges descending from MRCA (determined through trial and error)
oryzias_nodes <- seq(30, 33)
all_med_col <- ifelse(1:length(phylo_tree[["edge.length"]]) %in% oryzias_nodes, "#E84141", "black")
# set colours for tip labels
all_med_tip <- ifelse(phylo_tree$tip.label %in% ids, "#E84141", "black")
# plot
ape::plot.phylo(phylo_tree,
use.edge.length = T,
edge.color = all_med_col,
tip.color = all_med_tip,
font = 4)
# Save
png(file="~/Documents/Docs/medaka pics/20200921_introgression/tree_all_oryzias_latipes.png",
width=22,
height=25,
units = "cm",
res = 400)
ape::plot.phylo(phylo_tree,
use.edge.length = T,
edge.color = all_med_col,
tip.color = all_med_tip,
font = 4)
dev.off()
20201002 do one with just Oryzias
# Read in
phylo_tree <- ape::read.tree(file = "~/Documents/Repositories/mikk_genome/data/20201003_47-fish-epo_tree_oryzias_only.txt")
# Set colours
phylo_cols <- c("#55b6b0", "#f33a56", "#f3b61f", "#f6673a", "#631e68")
# Plot
ape::plot.phylo(phylo_tree,
font = 4,
tip.color = phylo_cols,
direction = "downwards",
adj = 0.5,
srt = 90)
# Save
png(file="~/Documents/Docs/medaka pics/20200921_introgression/20201003_tree_oryzias.png",
width=2700,
height=1720,
units = "px",
res = 400)
ape::plot.phylo(phylo_tree,
font = 4,
tip.color = phylo_cols)
dev.off()
end 20201002
# Read in data
dsets <- c("all", "latipes")
final_df <- lapply(dsets, function(x){
file_name <- paste("~/Documents/Repositories/mikk_genome/data/20200929_abbababa_final_oryzias_",
x,
".txt",
sep = "")
file_name
out <- read.csv(file_name, header = T, sep = "\t")
return(out)
})
names(final_df) <- paste("oryzias", dsets, sep = "_")
# combine into single DF
final_df <- dplyr::bind_rows(final_df, .id = "dataset")
# Set colours
cols <- c("#F3B61F", "#631E68", "#F6673A", "#F33A56", "#55B6B0")
names(cols) <- c("HdrR", "HSOK", "HNI", "melastigma", "javanicus")
# Factorise chr, P1 and P2 to get them in the right order
chr_order <- c(seq(1,24), "all")
fish_order <- c("HdrR", "HSOK", "HNI", "melastigma", "javanicus")
final_df <- final_df %>%
dplyr::mutate(across(c("P1", "P2"),
~factor(.x, levels = fish_order))) %>%
dplyr::mutate(chr = factor(chr, levels = chr_order))
final_df %>%
dplyr::filter(dataset == "oryzias_all",
chr == "all",
P1 %in% c("melastigma", "javanicus"),
P2 %in% c("HdrR", "HNI", "HSOK")) %>%
ggplot(aes(P2, admix_f, fill = P1)) +
geom_col(position = "dodge") +
geom_errorbar(aes(ymin = f_ci_lower,
ymax = f_ci_upper),
position = position_dodge(0.9),
width = 0.25) +
scale_fill_manual(values = cols) +
labs(y = "Admixture F-statistic (genome-wide)") +
ylim(0, 1) +
theme_bw() +
theme(panel.grid = element_blank()) +
labs(title = expression(paste("Comparing ",
italic("Oryzias melastigma"),
" and ",
italic("Oryzias javanicus"),
" as the P1 allopatric population",
sep = "")))
ggsave(filename = paste("20200930_P1_mela_vs_java", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
Do the same but using HNI and HSOK as the
final_df %>%
dplyr::filter(dataset == "oryzias_latipes",
chr == "all",
P1 %in% c("HNI", "HSOK"),
P2 %in% c("HdrR", "HNI", "HSOK")) %>%
dplyr::mutate(admix_f = replace(admix_f, P2 == "HNI", NA),
f_ci_lower = replace(f_ci_lower, P2 == "HNI", NA),
f_ci_upper = replace(f_ci_upper, P2 == "HNI", NA)) %>% # note I've made the values missing for the purposes of the plot
ggplot(aes(P2, admix_f, fill = P1)) +
geom_col(position = "dodge") +
geom_errorbar(aes(ymin = f_ci_lower,
ymax = f_ci_upper),
position = position_dodge(0.9),
width = 0.25) +
scale_fill_manual(values = cols) +
labs(y = "Admixture F-statistic (genome-wide)") +
ylim(0, 1) +
theme_bw() +
theme(panel.grid = element_blank()) +
ggtitle("Comparing HNI and HSOK as the P1 allopatric population, with closer ancestor")
ggsave(filename = paste("20200930_P1_hsok_v_hni", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
cor_df <- final_df %>%
dplyr::filter(dataset == "oryzias_all") %>%
# filter for when P1 is another Oryzias, and P2
dplyr::filter(P1 %in% c("javanicus", "melastigma") & P2 %in% c("HdrR", "HNI", "HSOK")) %>%
# pivot to put the admixture_f stat for melastigma and javanicus in the same row
tidyr::pivot_wider(id_cols = c("P2", "chr"),
names_from = P1,
values_from = c(admix_f, f_ci_lower, f_ci_upper))
cor_df %>%
dplyr::filter(chr != "all") %>%
ggplot(aes(admix_f_melastigma,
admix_f_javanicus,
colour = P2,
label = chr)) +
geom_point() +
scale_colour_manual(values = cols) +
theme_bw() +
labs(x = "Oryzias melastigma",
y = "Oryzias javanicus") +
theme(panel.grid = element_blank(),
axis.title = element_text(face = "italic"))
ggsave(filename = paste("20200930_P2_correlation", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/",
width = 16.5,
height = 13,
units = "cm",
dpi = 500)
cor_df$chr <- as.character(cor_df$chr)
cor_df$chr <- ifelse(cor_df$chr == "all", "genome-wide", cor_df$chr)
chr_order_plot <- c(seq(1,24), "genome-wide")
cor_df$chr <- factor(cor_df$chr, levels = chr_order_plot)
test <- cor_df %>%
dplyr::rowwise() %>%
dplyr::mutate(mean_f = mean(c(admix_f_javanicus, admix_f_melastigma)),
mean_ci_upper = mean(c(f_ci_upper_javanicus, f_ci_upper_melastigma)),
mean_ci_lower = mean(c(f_ci_lower_javanicus, f_ci_lower_melastigma)))
test %>%
ggplot(aes(P2, mean_f, fill = P2)) +
geom_col() +
geom_errorbar(aes(ymin = mean_ci_lower,
ymax = mean_ci_upper),
position = position_dodge(0.9),
width = 0.25) +
guides(fill = F) +
facet_wrap(~chr) +
ylim(0,1) +
ylab(expression(paste("Mean ", italic("f"), " statistic"))) +
theme_bw() +
theme(panel.grid = element_blank()) +
scale_fill_manual(values = cols)
ggsave(filename = paste("20200930_f_stat_by_chr", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
# Widen
wide_d <- final_df %>%
dplyr::filter(dataset == "oryzias_all") %>%
# filter for when P1 is another Oryzias, and P2
dplyr::filter(P1 %in% c("javanicus", "melastigma") & P2 %in% c("HdrR", "HNI", "HSOK")) %>%
# pivot to put the admixture_f stat for melastigma and javanicus in the same row
tidyr::pivot_wider(id_cols = c("P2", "chr"),
names_from = P1,
values_from = z_score)
# Fix factors
wide_d$chr <- as.character(wide_d$chr)
wide_d$chr <- ifelse(wide_d$chr == "all", "genome-wide", wide_d$chr)
chr_order_plot <- c(seq(1,24), "genome-wide")
wide_d$chr <- factor(wide_d$chr, levels = chr_order_plot)
wide_d %>%
dplyr::rowwise() %>%
dplyr::mutate(mean_z = mean(c(javanicus, melastigma)),
mean_ci_upper = mean(c(f_ci_upper_javanicus, f_ci_upper_melastigma)),
mean_ci_lower = mean(c(f_ci_lower_javanicus, f_ci_lower_melastigma))) %>%
ggplot() +
geom_col(aes(P2, mean_z, fill = P2)) +
geom_errorbar(aes(ymin = mean_ci_lower,
ymax = mean_ci_upper),
position = position_dodge(0.9),
width = 0.25) +
guides(fill = F) +
facet_wrap(~chr) +
ylab(expression(paste("Mean z-score for ", italic("D"), " statistic"))) +
theme_bw() +
theme(panel.grid = element_blank()) +
scale_fill_manual(values = cols)
ggsave(filename = paste("20200930_z_score_for_d_by_chr", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
# Get mean f statistics for 3 strains
test[test$chr == "all", colnames(test) %in% c("P2", "mean_f")]
# Which chromosomes are higher in HNI than HSOK?
higher_hni <- test %>% dplyr::select(P2, chr, mean_f) %>%
tidyr::pivot_wider(names_from = P2, values_from = mean_f)
higher_hni[higher_hni$HNI > higher_hni$HSOK, ]
Meeting with Tom:
https://github.com/simonhmartin/tutorials/blob/master/ABBA_BABA_windows/README.md
# abbababa_tutorial folder already exists, so adapt the code from the README slightly
mkdir data/sliding_windows
cd data/sliding_windows
wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_windows/data/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz
wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_windows/data/hel92.pop.txt
wget https://github.com/simonhmartin/tutorials/raw/master/ABBA_BABA_windows/data/chr18.LDhelmet_MLrho.w100.tsv
cd ../..
# Run the analysis python script
python genomics_general-master/ABBABABAwindows.py \
-g data/sliding_windows/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz -f phased \
-o data/sliding_windows/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w25m250.csv.gz \
-P1 mel_mel -P2 mel_ros -P3 cyd_chi -O num \
--popsFile data/sliding_windows/hel92.pop.txt -w 25000 -m 250 --T 2
python genomics_general-master/ABBABABAwindows.py \
-g data/sliding_windows/hel92.DP8HET75MP9BIminVar2.chr18.geno.gz -f phased \
-o data/sliding_windows/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ama_txn_num.w25m250.csv.gz \
-P1 mel_mel -P2 mel_ama -P3 tim_txn -O num \
--popsFile data/sliding_windows/hel92.pop.txt -w 25000 -m 250 --T 2
# Send to local
bash
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/abbababba_tutorial/data/sliding_windows/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel* ~/Documents/Data/20200910_abbababa_test
AB_files <- c("~/Documents/Data/20200910_abbababa_test/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ros_chi_num.w25m250.csv.gz",
"~/Documents/Data/20200910_abbababa_test/hel92.DP8HET75MP9BIminVar2.chr18.ABBABABA_mel_ama_txn_num.w25m250.csv.gz")
AB_tables = lapply(AB_files, read.csv)
head(AB_tables[[1]])
# Convert all fd values to 0 at sites where D is negative
for (x in 1:length(AB_tables)){
AB_tables[[x]]$fd = ifelse(AB_tables[[x]]$D < 0, 0, AB_tables[[x]]$fd)
}
# Plot fd across the chromosome
par(mfrow=c(length(AB_tables), 1), mar = c(4,4,1,1))
for (x in 1:length(AB_tables)){
plot(AB_tables[[x]]$mid, AB_tables[[x]]$fd,
type = "l", xlim=c(0,17e6),ylim=c(0,1),ylab="Admixture Proportion",xlab="Position")
rect(1000000,0,1250000,1, col = rgb(0,0,0,0.2), border=NA)
}
HdrR index: https://asia.ensembl.org/Oryzias_latipes/Info/Index Coding regions fasta page: ftp://ftp.ensembl.org/pub/release-101/fasta/oryzias_latipes/cds/
From the README:
“These files hold the coding sequences corresponding to Ensembl gene predictions. CDS does not contain UTR or intronic sequence.”
mkcd fastas_exons
wget ftp://ftp.ensembl.org/pub/release-101/fasta/oryzias_latipes/cds/Oryzias_latipes.ASM223467v1.cds.all.fa.gz
gunzip -k Oryzias_latipes.ASM223467v1.cds.all.fa.gz
cd ..
# How many exons?
grep ">" fastas_exons/Oryzias_latipes.ASM223467v1.cds.all.fa | wc -l
# 37468
# Confirm which info is the chromosome
grep ">" fastas_exons/Oryzias_latipes.ASM223467v1.cds.all.fa | cut -f3 -d" " | cut -f3 -d":" | sort | uniq
# Yes
# Columns 3, 4, 5, and 6 are chr, start, end, strand
grep ">" fastas_exons/Oryzias_latipes.ASM223467v1.cds.all.fa | cut -f3 -d" " | awk -F ":" -v OFS="\t" '{print $3, $4, $5, $6}'
# Works.
# Make file
## Create header
printf "chr\tstart\tend\tstrand\n" > stats/202010121_exon_pos.txt
# Send stats to file
grep ">" fastas_exons/Oryzias_latipes.ASM223467v1.cds.all.fa | cut -f3 -d" " | awk -F ":" -v OFS="\t" '{print $3, $4, $5, $6}' >> stats/202010121_exon_pos.txt
# Send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/stats/202010121_exon_pos.txt ~/Documents/Data/20201012_mikk_genome
# Read in total chromosome lengths
# Read in exon data
exons <- readr::read_tsv("~/Documents/Data/20201012_mikk_genome/202010121_exon_pos.txt")
biomaRt# Get lenghth of chromosomes
chr_counts <- readr::read_tsv("~/Documents/Repositories/mikk_genome/data/Oryzias_latipes.ASM223467v1.dna.toplevel.fa_chr_counts.txt",
col_names = c("chr", "length"))
── Column specification ────────────────────────────────────────────────────────────────────────
cols(
chr = col_double(),
length = col_double()
)
library("biomaRt")
listMarts()
# Select database and list datasets within
ensembl_mart <- useMart("ENSEMBL_MART_ENSEMBL")
listDatasets(ensembl_mart)
# Select dataset
ensembl_olat <- useDataset("olatipes_gene_ensembl", mart = ensembl_mart)
olat_mart = useEnsembl(biomart = "ensembl", dataset = "olatipes_gene_ensembl")
# Get attributes of interest (exon ID, chr, start, end)
exons <- getBM(attributes = c("chromosome_name", "ensembl_gene_id", "ensembl_transcript_id", "transcript_start", "transcript_end", "transcript_length", "ensembl_exon_id", "rank", "strand", "exon_chrom_start", "exon_chrom_end", "cds_start", "cds_end"),
mart = olat_mart)
# Factorise chr so it's in the right order
chrs <- unique(exons$chromosome_name)
auto_range <- range(as.integer(chrs), na.rm = T)
NAs introduced by coercion
non_auto <- chrs[is.na(as.integer(chrs))]
NAs introduced by coercion
chr_order <- c(seq(auto_range[1], auto_range[2]), non_auto)
exons$chromosome_name <- factor(exons$chromosome_name, levels = chr_order)
# Convert into list
exons_lst <- split(exons, f = exons$chromosome_name)
# Get mean length of exons per chromosome
exons_lst <- lapply(exons_lst, function(chr){
chr <- chr %>%
dplyr::mutate(exon_length = (exon_chrom_end - exon_chrom_start) + 1,
transcript_total_length = (transcript_end - transcript_start) + 1)
return(chr)
})
# Do histograms for exon length
lapply(exons_lst, function(chr){
hist(chr$exon_length)
})
$`1`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
[16] 7500 8000 8500 9000 9500 10000 10500
$counts
[1] 18088 793 304 177 123 118 98 79 59 49 15 7 5 3 2
[16] 0 0 1 0 0 2
$density
[1] 1.815791e-03 7.960648e-05 3.051749e-05 1.776841e-05 1.234754e-05 1.184561e-05 9.837876e-06
[8] 7.930533e-06 5.922803e-06 4.918938e-06 1.505797e-06 7.027054e-07 5.019324e-07 3.011595e-07
[15] 2.007730e-07 0.000000e+00 0.000000e+00 1.003865e-07 0.000000e+00 0.000000e+00 2.007730e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250
[16] 7750 8250 8750 9250 9750 10250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`2`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500
[19] 9000
$counts
[1] 9376 453 169 96 59 37 33 40 19 10 6 1 0 0 0 1 0 1
$density
[1] 1.820406e-03 8.795263e-05 3.281235e-05 1.863897e-05 1.145520e-05 7.183769e-06 6.407145e-06
[8] 7.766236e-06 3.688962e-06 1.941559e-06 1.164935e-06 1.941559e-07 0.000000e+00 0.000000e+00
[15] 0.000000e+00 1.941559e-07 0.000000e+00 1.941559e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250 7750 8250 8750
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`3`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
[16] 7500 8000 8500 9000 9500 10000 10500 11000
$counts
[1] 18015 692 276 183 100 115 78 72 50 36 25 6 2 3 0
[16] 0 0 0 0 0 1 2
$density
[1] 1.833028e-03 7.041107e-05 2.808303e-05 1.862027e-05 1.017501e-05 1.170126e-05 7.936508e-06
[8] 7.326007e-06 5.087505e-06 3.663004e-06 2.543753e-06 6.105006e-07 2.035002e-07 3.052503e-07
[15] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.017501e-07
[22] 2.035002e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250
[16] 7750 8250 8750 9250 9750 10250 10750
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`4`
$breaks
[1] 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000
[16] 15000 16000
$counts
[1] 18531 357 215 142 115 21 1 1 1 0 0 0 0 0 0
[16] 2
$density
[1] 9.558960e-04 1.841535e-05 1.109048e-05 7.324874e-06 5.932116e-06 1.083256e-06 5.158362e-08
[8] 5.158362e-08 5.158362e-08 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
[15] 0.000000e+00 1.031672e-07
$mids
[1] 500 1500 2500 3500 4500 5500 6500 7500 8500 9500 10500 11500 12500 13500 14500
[16] 15500
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`5`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500
$counts
[1] 17969 682 255 141 138 111 92 95 64 47 13 7 5 0 1
$density
[1] 1.831702e-03 6.952090e-05 2.599388e-05 1.437309e-05 1.406728e-05 1.131498e-05 9.378186e-06
[8] 9.683996e-06 6.523955e-06 4.791030e-06 1.325178e-06 7.135576e-07 5.096840e-07 0.000000e+00
[15] 1.019368e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`6`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
[16] 7500 8000 8500 9000 9500 10000 10500 11000 11500
$counts
[1] 18129 708 259 135 99 115 92 77 59 44 21 4 2 5 5
[16] 0 1 0 0 0 0 0 2
$density
[1] 1.835198e-03 7.167080e-05 2.621856e-05 1.366604e-05 1.002176e-05 1.164144e-05 9.313155e-06
[8] 7.794706e-06 5.972567e-06 4.454118e-06 2.125829e-06 4.049198e-07 2.024599e-07 5.061497e-07
[15] 5.061497e-07 0.000000e+00 1.012299e-07 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
[22] 0.000000e+00 2.024599e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250
[16] 7750 8250 8750 9250 9750 10250 10750 11250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`7`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500
$counts
[1] 20593 873 314 152 146 109 95 104 50 56 20 7 5 1 0
[16] 2 1
$density
[1] 1.828214e-03 7.750355e-05 2.787642e-05 1.349432e-05 1.296165e-05 9.676847e-06 8.433949e-06
[8] 9.232955e-06 4.438920e-06 4.971591e-06 1.775568e-06 6.214489e-07 4.438920e-07 8.877841e-08
[15] 0.000000e+00 1.775568e-07 8.877841e-08
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250 7750 8250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`8`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500
[19] 9000
$counts
[1] 17551 788 276 166 128 104 66 63 69 53 8 3 5 4 0
[16] 4 0 5
$density
[1] 1.819416e-03 8.168766e-05 2.861141e-05 1.720831e-05 1.326906e-05 1.078111e-05 6.841860e-06
[8] 6.530866e-06 7.152853e-06 5.494221e-06 8.293163e-07 3.109936e-07 5.183227e-07 4.146582e-07
[15] 0.000000e+00 4.146582e-07 0.000000e+00 5.183227e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250 7750 8250 8750
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`9`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
$counts
[1] 18990 688 257 135 128 111 62 88 65 42 11 5 5 4
$density
[1] 1.844495e-03 6.682531e-05 2.496236e-05 1.311252e-05 1.243262e-05 1.078141e-05 6.022048e-06
[8] 8.547424e-06 6.313438e-06 4.079452e-06 1.068428e-06 4.856491e-07 4.856491e-07 3.885193e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`10`
$breaks
[1] 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000
[16] 15000 16000 17000 18000
$counts
[1] 12987 349 209 115 83 22 6 0 5 0 0 0 0 0 0
[16] 0 0 2
$density
[1] 9.425896e-04 2.533024e-05 1.516911e-05 8.346640e-06 6.024096e-06 1.596748e-06 4.354768e-07
[8] 0.000000e+00 3.628974e-07 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
[15] 0.000000e+00 0.000000e+00 0.000000e+00 1.451589e-07
$mids
[1] 500 1500 2500 3500 4500 5500 6500 7500 8500 9500 10500 11500 12500 13500 14500
[16] 15500 16500 17500
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`11`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500
$counts
[1] 12869 582 182 130 99 87 68 49 41 33 22 7 2 0 0
[16] 0 7
$density
[1] 1.815348e-03 8.209903e-05 2.567358e-05 1.833827e-05 1.396530e-05 1.227253e-05 9.592326e-06
[8] 6.912117e-06 5.783608e-06 4.655099e-06 3.103400e-06 9.874453e-07 2.821272e-07 0.000000e+00
[15] 0.000000e+00 0.000000e+00 9.874453e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250 7750 8250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`12`
$breaks
[1] 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000
[16] 15000 16000 17000
$counts
[1] 15293 341 184 138 85 24 0 0 0 0 0 0 1 0 0
[16] 0 1
$density
[1] 9.518267e-04 2.122363e-05 1.145204e-05 8.589033e-06 5.290347e-06 1.493745e-06 0.000000e+00
[8] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.223937e-08 0.000000e+00
[15] 0.000000e+00 0.000000e+00 6.223937e-08
$mids
[1] 500 1500 2500 3500 4500 5500 6500 7500 8500 9500 10500 11500 12500 13500 14500
[16] 15500 16500
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`13`
$breaks
[1] 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000
$counts
[1] 17240 371 212 138 111 37 2 1 0 0 0 3
$density
[1] 9.516975e-04 2.048026e-05 1.170301e-05 7.617996e-06 6.127519e-06 2.042506e-06 1.104057e-07
[8] 5.520287e-08 0.000000e+00 0.000000e+00 0.000000e+00 1.656086e-07
$mids
[1] 500 1500 2500 3500 4500 5500 6500 7500 8500 9500 10500 11500
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`14`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000
$counts
[1] 15777 677 272 153 103 108 73 83 50 38 32 2
$density
[1] 1.816789e-03 7.795947e-05 3.132197e-05 1.761861e-05 1.186089e-05 1.243667e-05 8.406264e-06
[8] 9.557807e-06 5.757715e-06 4.375864e-06 3.684938e-06 2.303086e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`15`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500
[19] 9000 9500
$counts
[1] 14726 603 209 148 95 110 67 39 37 32 14 4 1 1 0
[16] 0 0 0 1
$density
[1] 1.830795e-03 7.496736e-05 2.598371e-05 1.839995e-05 1.181078e-05 1.367564e-05 8.329707e-06
[8] 4.848636e-06 4.599988e-06 3.978368e-06 1.740536e-06 4.972960e-07 1.243240e-07 1.243240e-07
[15] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.243240e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250 7750 8250 8750
[19] 9250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`16`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500
$counts
[1] 17298 782 303 184 112 125 103 83 56 55 19 4 11 1 3
$density
[1] 1.807618e-03 8.171796e-05 3.166310e-05 1.922775e-05 1.170385e-05 1.306233e-05 1.076336e-05
[8] 8.673389e-06 5.851925e-06 5.747427e-06 1.985475e-06 4.179947e-07 1.149485e-06 1.044987e-07
[15] 3.134960e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`17`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
[16] 7500 8000 8500 9000 9500 10000 10500 11000 11500
$counts
[1] 16699 726 233 163 111 95 64 48 40 46 15 7 1 0 2
[16] 0 0 0 0 0 0 0 1
$density
[1] 1.829927e-03 7.955728e-05 2.553285e-05 1.786203e-05 1.216372e-05 1.041039e-05 7.013314e-06
[8] 5.259986e-06 4.383321e-06 5.040820e-06 1.643746e-06 7.670813e-07 1.095830e-07 0.000000e+00
[15] 2.191661e-07 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
[22] 0.000000e+00 1.095830e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250
[16] 7750 8250 8750 9250 9750 10250 10750 11250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`18`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500
$counts
[1] 11747 606 194 106 79 69 40 49 42 30 9 2 0 5 0
[16] 0 2
$density
[1] 1.810015e-03 9.337442e-05 2.989214e-05 1.633282e-05 1.217257e-05 1.063174e-05 6.163328e-06
[8] 7.550077e-06 6.471495e-06 4.622496e-06 1.386749e-06 3.081664e-07 0.000000e+00 7.704160e-07
[15] 0.000000e+00 0.000000e+00 3.081664e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250 7750 8250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`19`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500
$counts
[1] 15532 627 225 141 132 97 70 67 57 26 14 6 3 3 1
$density
[1] 1.827187e-03 7.376037e-05 2.646903e-05 1.658726e-05 1.552850e-05 1.141109e-05 8.234810e-06
[8] 7.881889e-06 6.705488e-06 3.058644e-06 1.646962e-06 7.058408e-07 3.529204e-07 3.529204e-07
[15] 1.176401e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`20`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
[16] 7500 8000 8500 9000 9500 10000 10500
$counts
[1] 11455 412 162 111 80 65 52 33 47 40 10 4 3 2 0
[16] 0 1 0 0 0 1
$density
[1] 1.836031e-03 6.603622e-05 2.596570e-05 1.779131e-05 1.282257e-05 1.041834e-05 8.334669e-06
[8] 5.289309e-06 7.533259e-06 6.411284e-06 1.602821e-06 6.411284e-07 4.808463e-07 3.205642e-07
[15] 0.000000e+00 0.000000e+00 1.602821e-07 0.000000e+00 0.000000e+00 0.000000e+00 1.602821e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250
[16] 7750 8250 8750 9250 9750 10250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`21`
$breaks
[1] 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000
$counts
[1] 15639 336 178 119 91 13 2 0 0 1 2
$density
[1] 9.547036e-04 2.051157e-05 1.086625e-05 7.264514e-06 5.555216e-06 7.936023e-07 1.220927e-07
[8] 0.000000e+00 0.000000e+00 6.104633e-08 1.220927e-07
$mids
[1] 500 1500 2500 3500 4500 5500 6500 7500 8500 9500 10500
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`22`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500
$counts
[1] 14567 603 210 124 105 102 81 54 53 41 19 1 4 1 1
$density
[1] 1.824753e-03 7.553551e-05 2.630590e-05 1.553301e-05 1.315295e-05 1.277715e-05 1.014656e-05
[8] 6.764374e-06 6.639108e-06 5.135914e-06 2.380058e-06 1.252662e-07 5.010648e-07 1.252662e-07
[15] 1.252662e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`23`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500
[19] 9000
$counts
[1] 11353 401 146 95 58 66 55 39 22 57 10 1 0 1 0
[16] 0 0 2
$density
[1] 1.845116e-03 6.517146e-05 2.372826e-05 1.543962e-05 9.426296e-06 1.072647e-05 8.938729e-06
[8] 6.338372e-06 3.575492e-06 9.263774e-06 1.625223e-06 1.625223e-07 0.000000e+00 1.625223e-07
[15] 0.000000e+00 0.000000e+00 0.000000e+00 3.250447e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750 7250 7750 8250 8750
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$`24`
$breaks
[1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
$counts
[1] 12185 531 215 113 78 86 83 50 56 27 16 6 2 1
$density
[1] 1.812031e-03 7.896498e-05 3.197264e-05 1.680422e-05 1.159938e-05 1.278905e-05 1.234293e-05
[8] 7.435497e-06 8.327757e-06 4.015168e-06 2.379359e-06 8.922596e-07 2.974199e-07 1.487099e-07
$mids
[1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750 6250 6750
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
$MT
$breaks
[1] 0 200 400 600 800 1000 1200 1400 1600 1800 2000
$counts
[1] 23 2 1 3 2 2 1 1 1 1
$density
[1] 0.0031081081 0.0002702703 0.0001351351 0.0004054054 0.0002702703 0.0002702703 0.0001351351
[8] 0.0001351351 0.0001351351 0.0001351351
$mids
[1] 100 300 500 700 900 1100 1300 1500 1700 1900
$xname
[1] "chr$exon_length"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
# get total length of chr covered by exons
exon_lengths <- lapply(exons_lst, function(chr){
# create list of start pos to end pos sequences for each exon
out_list <- apply(chr, 1, function(exon) {
seq(exon[["exon_chrom_start"]], exon[["exon_chrom_end"]])
})
# combine list of vectors into single vector and get only unique numbers
out_vec <- unique(unlist(out_list))
# get length of out_vec and put it into data frame
out_final <- data.frame("exon_cov" = length(out_vec))
return(out_final)
})
# combine into single DF
exons_len_df <- dplyr::bind_rows(exon_lengths, .id = "chr") %>%
dplyr::filter(chr != "MT") %>%
dplyr::mutate(chr = as.integer(chr))
# join with chr_counts and get proportion of chr covered by exons
chr_stats <- dplyr::left_join(chr_counts, exons_len_df, by = "chr") %>%
dplyr::mutate(prop_cov_exon = exon_cov / length)
# convert chr to factor for plotting
chr_stats$chr <- factor(chr_stats$chr)
chr_stats %>%
ggplot() +
geom_col(aes(chr, prop_cov_exon, fill = chr)) +
theme_bw() +
theme(panel.grid = element_blank()) +
guides(fill = F) +
xlab("Chromosome") +
ylab("Proportion of chromosome covered by exons")
ggsave(filename = paste("20201012_exon_props", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200602_mikk_genome/",
width = 24.125,
height = 12.5,
units = "cm",
dpi = 500)
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200727_mean_r2_10kb-lim_mikk/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200727_mean_r2_10kb-lim_mikk")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_mikk <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_mikk$chr <- factor(r2_df_1kb_mikk$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb_mikk$bin_bdr_kb <- r2_df_1kb_mikk$bin_bdr / 1000
r2_df_1kb_mikk %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance between SNPs (kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~chr, nrow = 4, ncol = 6) +
theme(panel.grid = element_blank()) +
labs(colour = "Chromosome") +
scale_y_continuous(breaks = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6),
limits = c(0.05, 0.6))
ggsave(filename = paste("20201012_ld_decay_facets", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200602_mikk_genome/",
width = 30,
height = 19.5,
units = "cm",
dpi = 500)
# Get mean LD decay for distance window of 0-100 bp
ld_100bp <- r2_df_1kb_mikk %>%
dplyr::filter(bin_bdr == 0) %>%
dplyr::select(chr, first_bin_mean = mean)
# Get mean LD across each chromosome
ld_mean <- r2_df_1kb_mikk %>%
dplyr::group_by(chr) %>%
dplyr::summarise(total_mean = mean(mean))
final_df <- chr_stats %>%
dplyr::left_join(ld_100bp, by = "chr") %>%
dplyr::left_join(ld_mean, by = "chr")
ggplot(final_df, aes(prop_cov_exon, total_mean, colour = chr, label = chr)) +
geom_point() +
geom_text(hjust = -0.5) +
theme_bw() +
theme(panel.grid = element_blank()) +
guides(colour = F) +
ylab("Mean LD decay across all distance windows") +
xlab("Proportion of chromosome covered by exons")
ggsave(filename = paste("20201012_exons_v_meanld", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200602_mikk_genome/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
ggplot(final_df, aes(prop_cov_exon, first_bin_mean, colour = chr, label = chr)) +
geom_point() +
geom_text(hjust = -0.5) +
theme_bw() +
theme(panel.grid = element_blank()) +
guides(colour = F) +
ylab("Mean r2 for SNPs 0 to 100 bp apart") +
xlab("Proportion of chromosome covered by exons")
ggsave(filename = paste("20201012_exons_v_firstwindowld", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200602_mikk_genome/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
Could it be due to differences in the length of introns?
# Split exon_lst by transcript
ex_to_trans <- lapply(exons_lst, function(chr){
# split by transcript
ts_lst <- split(chr, f = chr$ensembl_transcript_id)
# check whether all transcript lengths are the same
#all(unlist(lapply(ts_lst, function(x){
# length(unique(x$transcript_length)) == 1
#})))
# Create DF with exon and total transcript length
ts_df <- lapply(ts_lst, function(x){
out_df <- data.frame("transcript_length" = unique(x$transcript_length),
"total_exon_length" = sum(x$exon_length),
"transcript_total_length" = unique(x$transcript_total_length))
# get proportion of transcript length to transcript total length
out_df <- out_df %>%
dplyr::mutate(ratio_exon_to_transcript = transcript_length / transcript_total_length )
return(out_df)
})
# bind DFs
ts_df <- dplyr::bind_rows(ts_df, .id = "ensembl_transcript_id")
# take transcript length
return(ts_df)
})
# Bind into single df
ex_to_trans_df <- dplyr::bind_rows(ex_to_trans, .id = "chr")
# Factor
ex_to_trans_df$chr <- factor(ex_to_trans_df$chr, levels = seq(1, 24))
Plot
ggplot(ex_to_trans_df) +
geom_boxplot(aes(chr, ratio_exon_to_transcript, fill = chr)) +
theme_bw() +
theme(panel.grid = element_blank()) +
guides(fill = F) +
xlab("Chromosome") +
ylab("Total length of exon sequences\n / total transcript length (end_pos - start_pos)")
ggsave(filename = paste("20201012_exon_to_transcript", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200602_mikk_genome/",
width = 24.125,
height = 12.5,
units = "cm",
dpi = 500)
ex_to_trans_med <- ex_to_trans_df %>%
dplyr::group_by(chr) %>%
dplyr::summarise(med_ex_to_trans = median(ratio_exon_to_transcript))
# sanity check
ggplot(ex_to_trans_med) +
geom_col(aes(chr, med_ex_to_trans, fill = chr))
# bind to chr_stats
final_df <- dplyr::left_join(final_df, med_ex_to_trans, by = "chr")
Plot
ggplot(final_df, aes(med_ex_to_trans, first_bin_mean, colour = chr, label = chr)) +
geom_point() +
geom_text(hjust = -0.5) +
theme_bw() +
theme(panel.grid = element_blank()) +
guides(colour = F) +
xlab("Median ratio of exon to total transcript") +
ylab("Mean r2 for SNPs 0 to 100 bp apart")
Get correlations
# Mean r^2 across all windows vs proportion of exon coverage
cor.test(final_df$total_mean, final_df$prop_cov_exon, method = "spearman")
# Mean r^2 between SNPs 1 to 100bp apart vs proportion of exon coverage
cor.test(final_df$total_mean, final_df$prop_cov_exon, method = "spearman")
# Mean r^2 across all windows vs median exon-to-transcript
cor.test(final_df$total_mean, final_df$med_ex_to_trans, method = "spearman")
# How many comparisons?
wc -l ld/20200727_mikk_maf-0.10_window-50kb_no-missing/*.ld
35849304 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/10.ld 33628705 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/11.ld 20296544 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/12.ld 31234421 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/13.ld 24737777 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/14.ld 24107349 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/15.ld 33794836 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/16.ld 32416583 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/17.ld 88536268 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/18.ld 31803630 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/19.ld 54768157 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/1.ld 25460469 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/20.ld 29680373 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/21.ld 17837411 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/22.ld 20639723 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/23.ld 26408595 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/24.ld 83028231 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/2.ld 35789791 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/3.ld 26365446 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/4.ld 22387640 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/5.ld 22042719 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/6.ld 34370140 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/7.ld 34606016 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/8.ld 34432214 ld/20200727_mikk_maf-0.10_window-50kb_no-missing/9.ld 824222342 total
# How many SNPs?
## Total
wc -l plink/20200727_mikk_no-missing_maf-0.05/20200727.bim
# 4357116
## Per chr
cut -f1 plink/20200727_mikk_no-missing_maf-0.05/20200727.bim | sort | uniq -c
252083 1 193656 10 172997 11 151599 12 195023 13 163803 14 164476 15 197105 16 195097 17 263572 18 153555 19 245984 2 146468 20 174783 21 138303 22 132032 23 137189 24 207470 3 174023 4 174974 5 164055 6 196475 7 169951 8 192404 9
# OLD
# Note this only takes the sequences on the forward strand - still gets 4552 / 9338 segments
for i in $(find emfs/segmented/*/*_1.data.txt ); do
# make new directory
chr_file=$(echo $i | cut -f3 -d"/" );
new_path=$(echo emfs/cleaned/$chr_file );
if [ ! -d "$new_path" ]; then
mkdir $new_path;
fi
# create name for new file
bname=$(echo $i | sed 's/.data.txt//g' );
bname_short=$(basename $bname );
bsub \
-M 10000 \
-o log/20200925_$chr_file\_$bname_short.out \
-e log/20200925_$chr_file\_$bname_short.err \
"Rscript --vanilla mikk_genome/code/scripts/20200908_add-hdrr-coords-to-emf-data.R $bname $new_path";
done
# First flip files
for i in $( find emfs/segmented/*/*_-1.data.txt ) ; do
dir_out=$( dirname $i );
bname=$( basename $i | cut -f1,2,3 -d"_" );
file_out=$dir_out/$bname\_1_rc.data.txt;
bsub \
-M 10000 \
-o log/20201015_reverse_$bname.out \
-e log/20201015_reverse_$bname.err \
"tac $i > $file_out";
done
# Then clean
for i in $( find emfs/segmented/*/*rc.data.txt ); do
# make new directory
chr_file=$( echo $i | cut -f3 -d"/" );
new_path=$( echo emfs/cleaned/$chr_file );
if [ ! -d "$new_path" ]; then
mkdir $new_path;
fi
# create name for new file
bname=$( echo $i | sed 's/_1_rc.data.txt//g' );
bname_short=$( basename $bname );
script=mikk_genome/code/scripts/20201015_add-hdrr-coords-to-emf-data_rc.R ;
bsub \
-M 10000 \
-o log/20201015_$chr_file\_$bname_short.out \
-e log/20201015_$chr_file\_$bname_short.err \
"Rscript --vanilla $script $bname $new_path";
done
# Other than 17_1, only failed on this: 24_2/24_16480574_16481928
# Consolidate
mkdir emfs/consolidated_with_rc
# run for each sub-chr (e.g. 10_1, 10_2)
for i in $(find emfs/cleaned/* -type d ); do
# get chromosome
chr=$(echo $i | cut -f3 -d"/" ) ;
# set script
script=mikk_genome/code/scripts/20201015_consolidate_chr_dat.R ;
# set output directory
out_dir=emfs/consolidated_with_rc ;
# consolidate into one file
bsub \
-M 30000 \
-o log/20201015_consol_$chr.out \
-e log/20201015_consol_$chr.err \
"Rscript --vanilla $script $i $out_dir/$chr.txt " ;
done
# combine for each chr
mkdir emfs/final_with_rc
for i in $(find emfs/consolidated_with_rc/* | cut -f3 -d"/" | cut -f1 -d"_" | sort | uniq ); do
script=mikk_genome/code/scripts/20201015_consolidate_chr_files.R ;
in_dir=emfs/consolidated_with_rc ;
out_dir=emfs/final_with_rc ;
bsub \
-M 50000 \
-o log/20201019_full_chr_$i.out \
-e log/20201019_full_chr_$i.err \
"Rscript --vanilla $script $i $in_dir $out_dir ";
done
# combine full, A and B MAF datasets
mkdir maf/20201015_split-by-chr_all
for i in $(seq 1 24 ); do
printf "chr_pos\tchr\tpos\tref\talt\tmikk\tmikk_a\tmikk_b\n" \
> maf/20201015_split-by-chr_all/$i.txt ;
cut -f6 maf/20200915_split-by-chr_1/$i.txt > tmpa.txt ;
cut -f6 maf/20200915_split-by-chr_2/$i.txt > tmpb.txt ;
paste maf/20200910_split-by-chr/$i.txt \
tmpa.txt \
tmpb.txt \
>> maf/20201015_split-by-chr_all/$i.txt ;
rm tmpa.txt tmpb.txt;
done
# bind MAF and EMF datasets
mkdir abba_baba_mikk/20201015_freq_tables
for i in $( seq 1 24 ); do
dat_file=$( echo emfs/final_with_rc/$i.txt ) ;
af_file=$( echo maf/20201015_split-by-chr_all/$i.txt ) ;
dir_out=abba_baba_mikk/20201015_freq_tables ;
script=mikk_genome/code/scripts/20201017_combine_final_and_af_data.R ;
bsub \
-M 30000 \
-o log/20201017_frq_tbl_$i.out \
-e log/20201017_frq_tbl_$i.err \
"Rscript --vanilla $script $i $dat_file $af_file $dir_out " ;
done
# split by individuals, take non-missing, biallelic SNPs, and add AF all at once
mkdir vcfs/20201014_indiv-split_non-missing_biallelic-snps
for i in $(bcftools query -l vcfs/full-run_line-ids.vcf.gz ); do
input_vcf=vcfs/full-run_line-ids.vcf.gz \
output_dir=vcfs/20201014_indiv-split_non-missing_biallelic-snps \
bsub \
-o log/20201014_indiv_split_$i.out \
-e log/20201014_indiv_split_$i.err \
"bcftools view \
--samples $i \
--output-type u \
$input_vcf | \
bcftools view \
--min-alleles 2 \
--max-alleles 2 \
--types snps \
--genotype ^miss \
--output-type u | \
bcftools +fill-tags \
--output-type z \
--output $output_dir/$i.vcf.gz \
-- \
--tags AF";
done
# Do again for sub-populations
mkdir vcfs/20201014_subpops_non-missing_biallelic-snps
output_dir=vcfs/20201014_subpops_non-missing_biallelic-snps
## KW (wild Kiyosu)
pop_name=kiyosu_wild
bcftools view \
--samples KW_1,KW_2,KW_3,KW_4,KW_5,KW_6,KW_7 \
--output-type u \
$input_vcf | \
bcftools view \
--min-alleles 2 \
--max-alleles 2 \
--types snps \
--genotype ^miss \
--output-type u | \
bcftools +fill-tags \
--output-type z \
--output $output_dir/$pop_name.vcf.gz \
-- \
--tags AF
## iCab
pop_name=icab
bcftools view \
--samples iCab_F24,iCab_2F2,iCab_HD,iCab_F25 \
--output-type u \
$input_vcf | \
bcftools view \
--min-alleles 2 \
--max-alleles 2 \
--types snps \
--genotype ^miss \
--output-type u | \
bcftools +fill-tags \
--output-type z \
--output $output_dir/$pop_name.vcf.gz \
-- \
--tags AF
## HO5
pop_name=ho5
bcftools view \
--samples Ho5_1,Ho5_2 \
--output-type u \
$input_vcf | \
bcftools view \
--min-alleles 2 \
--max-alleles 2 \
--types snps \
--genotype ^miss \
--output-type u | \
bcftools +fill-tags \
--output-type z \
--output $output_dir/$pop_name.vcf.gz \
-- \
--tags AF
# For individuals
mkdir maf/20201014_indiv_split
for i in $( find vcfs/20201014_indiv-split_non-missing_biallelic-snps/* ); do
name=$( basename $i | cut -f1 -d"." );
output_dir=maf/20201014_indiv_split;
bsub \
-o log/20201014_get_maf_$i.out \
-e log/20201014_get_maf_$i.err \
"bcftools query \
--format '%CHROM\t%POS\t%REF\t%ALT\t%INFO/AF\n' \
--output-file $output_dir/$name.txt \
$i";
done
# Run R script over them to add chr_pos column, and pop_a and pop_b AF columns
mkdir maf/20201014_indiv_split_with_ab
for i in $( find maf/20201014_indiv_split/* ); do
sample=$( basename $i | cut -f1 -d"." );
script=mikk_genome/code/scripts/20201014_tidy_indiv_maf_files.R ;
file_in=$i ;
file_out=maf/20201014_indiv_split_with_ab/$sample.txt ;
bsub \
-M 20000 \
-o log/20201014_tidy_maf_$sample.out \
-e log/20201014_tidy_maf_$sample.err \
"Rscript --vanilla $script $file_in $file_out ";
done
input_vcf=vcfs/full-run_line-ids.vcf.gz
output_dir=vcfs/20201014_subpops_non-missing_biallelic-snps
# KW (wild Kiyosu)
pop_name=kiyosu_wild
## KW_a
bcftools view \
--samples KW_1,KW_2,KW_3,KW_4 \
--output-type u \
$input_vcf | \
bcftools view \
--min-alleles 2 \
--max-alleles 2 \
--types snps \
--genotype ^miss \
--output-type u | \
bcftools +fill-tags \
--output-type z \
--output $output_dir/$pop_name\_a.vcf.gz \
-- \
--tags AF
## KW_b
bcftools view \
--samples KW_5,KW_6,KW_7 \
--output-type u \
$input_vcf | \
bcftools view \
--min-alleles 2 \
--max-alleles 2 \
--types snps \
--genotype ^miss \
--output-type u | \
bcftools +fill-tags \
--output-type z \
--output $output_dir/$pop_name\_b.vcf.gz \
-- \
--tags AF
# iCab
pop_name=icab
## a
bcftools view \
--samples iCab_F24,iCab_2F2 \
--output-type u \
$input_vcf | \
bcftools view \
--min-alleles 2 \
--max-alleles 2 \
--types snps \
--genotype ^miss \
--output-type u | \
bcftools +fill-tags \
--output-type z \
--output $output_dir/$pop_name\_a.vcf.gz \
-- \
--tags AF
## b
bcftools view \
--samples iCab_HD,iCab_F25 \
--output-type u \
$input_vcf | \
bcftools view \
--min-alleles 2 \
--max-alleles 2 \
--types snps \
--genotype ^miss \
--output-type u | \
bcftools +fill-tags \
--output-type z \
--output $output_dir/$pop_name\_b.vcf.gz \
-- \
--tags AF
# HO5
pop_name=ho5
## Just copy over the individual ones
cp vcfs/20201014_indiv-split_non-missing_biallelic-snps/Ho5_1.vcf.gz $output_dir/$pop_name\_a.vcf.gz
cp vcfs/20201014_indiv-split_non-missing_biallelic-snps/Ho5_2.vcf.gz $output_dir/$pop_name\_b.vcf.gz
# Create MAF tables
mkdir maf/20201015_subpops
for i in $( find vcfs/20201014_subpops_non-missing_biallelic-snps/* ); do
name=$( basename $i | cut -f1 -d"." );
output_dir=maf/20201015_subpops;
bsub \
-o log/20201015_get_maf_$name.out \
-e log/20201015_get_maf_$name.err \
"bcftools query \
--format '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/AF\n' \
--output-file $output_dir/$name.txt \
$i";
done
# Split by chr
mkdir maf/20201017_subpops_split-by-chr
for i in $( find maf/20201015_subpops/* ) ; do
pop_name=$( basename $i | sed 's/.txt//g' ) ;
new_path=maf/20201017_subpops_split-by-chr/$pop_name ;
if [ ! -d "$new_path" ] ; then
mkdir $new_path ;
fi ;
for j in $(cut -f2 $i | sort | uniq ) ; do
awk "\$2 == $j" $i \
> $new_path/$j.txt ;
done ;
done
ONT samples (n = 9): mikk_genome/data/20201017_ont_stable_samples.txt
4_1 4_2 7_2 11_1 69_1 79_2 80_1 117_2 134_1
# Split ONT samples by chr
mkdir maf/20201014_indiv_split_with_ab_split-by-chr
cat mikk_genome/data/20201019_ont_samples.txt
for i in $( echo 7_1 131_1 134_2 ) ; do
input_file=maf/20201014_indiv_split_with_ab/$i.txt ;
new_path=maf/20201014_indiv_split_with_ab_split-by-chr/$i ;
if [ ! -d "$new_path" ] ; then
mkdir $new_path ;
fi ;
for j in $(cut -f2 $input_file | sort | uniq ) ; do
awk "\$2 == $j" $input_file \
> $new_path/$j.txt ;
done ;
done
# Combine ONT samples
mkdir maf/20201017_ont_samples_split-by-chr
for i in $( seq 1 24 ) ; do
chr=$i ;
in_dir=maf/20201014_indiv_split_with_ab_split-by-chr ;
out_dir=maf/20201017_ont_samples_split-by-chr ;
script=mikk_genome/code/scripts/20201017_combine_ont_sample_mafs.R ;
bsub \
-M 30000 \
-o log/20201019_combine-ont_$chr.out \
-e log/20201019_combine-ont_$chr.err \
"Rscript --vanilla $script $chr $in_dir $out_dir ";
done
# Combine subpop files
mkdir maf/20201017_subpops_split-by-chr_all
for i in $( seq 1 24 ) ; do
chr=$i ;
in_dir=maf/20201017_subpops_split-by-chr ;
out_dir=maf/20201017_subpops_split-by-chr_all ;
script=mikk_genome/code/scripts/20201017_combine_subpop_sample_mafs.R ;
bsub \
-M 30000 \
-o log/20201019_combine-subp_$chr.out \
-e log/20201019_combine-subp_$chr.err \
"Rscript --vanilla $script $chr $in_dir $out_dir ";
done
# Combine MIKK total, ONT, and subpop MAF tables
mkdir maf/20201019_all
for i in $( seq 1 24 ) ; do
script=mikk_genome/code/scripts/20201017_combine_final_and_af_data.R ;
chr=$i ;
in_file_mik=maf/20201015_split-by-chr_all/$chr.txt ;
in_file_sub=maf/20201017_subpops_split-by-chr_all/$chr.txt ;
in_file_ont=maf/20201017_ont_samples_split-by-chr/$chr.txt ;
in_file_emf=emfs/final_with_rc/$chr.txt ;
out_dir=maf/20201019_all ;
bsub \
-M 50000 \
-o log/20201020_frq_tbl_$i.out \
-e log/20201020_frq_tbl_$i.err \
"Rscript --vanilla $script $chr $in_file_mik $in_file_sub $in_file_ont $in_file_emf $out_dir " ;
done
# Create dir for final output
mkdir abba_baba_mikk/20201020_freq_tables
Combine chr files into single file - run directly on cluster
files <- list.files("maf/20201019_all",
pattern = "[0-9]",
full.names = T)
# read in data
dat_list <- lapply(files, function(x){
# read in data
df <- read.table(x,
header = T,
sep = "\t",
as.is = T,
check.names = F)
return(df)
})
# bind together and sort
final_df <- dplyr::bind_rows(dat_list) %>% # bind into single DF
dplyr::arrange(chr, pos) # sort by chromosome, then position
# write table
write.table(final_df, "abba_baba_mikk/20201020_freq_tables/all.txt", quote = F, sep = "\t", row.names = F)
# Run ABBA BABA
mkdir abba_baba_mikk/20201020_rlists
# get string of samples
ont_samples=$( cat mikk_genome/data/20201019_ont_samples.txt | tr "\n" " ")
for i in $( echo melastigma javanicus ); do
for j in $( echo hdrr hni hsok ho5 icab kiyosu_wild mikk ); do
for k in $( echo mikk ho5 icab kiyosu_wild $ont_samples ) ; do
# don't run if they are the same
if [ $j != $k ] ; then
script=mikk_genome/code/scripts/20201020_run_abbababa.R ;
p1=$i ;
p2=$j ;
p3=$k ;
dat_file=$( echo abba_baba_mikk/20201020_freq_tables/all.txt ) ;
dir_out=$( echo abba_baba_mikk/20201020_rlists ) ;
bsub \
-M 30000 \
-o log/20201020_abbababa_$p1\_$p2\_$p3.out \
-e log/20201020_abbababa_$p1\_$p2\_$p3.err \
"Rscript --vanilla $script $p1 $p2 $p3 $dat_file $dir_out "
fi ;
done ;
done ;
done
library(tidyverse)
# get list of target files
files <- list.files("abba_baba_mikk/20201020_rlists",
full.names = T)
# read into one list
final_lst = lapply(files, readRDS)
# Reduce down to DF
final_df <- lapply(final_lst, function(file){
# Get D stats
g_wide_d <- file[["Genome-wide D"]][["D statistic"]]
per_chr_d <- sapply(file[["Per-chromosome"]],
function(x) x[["D statistic"]])
# Get Z scores
g_wide_z <- file[["Genome-wide D"]][["Z score"]]
per_chr_z <- sapply(file[["Per-chromosome"]],
function(x) x[["Z score"]])
# Get admixture
admix_f <- file[["Admixture"]][["f statistic"]]
per_chr_f <- sapply(file[["Per-chromosome"]],
function(x) x[["Admixture"]][["f statistic"]])
# Get confidence intervals
g_wide_ci_lower <- file[["Admixture"]][["Confidence interval"]][["lower"]]
g_wide_ci_upper <- file[["Admixture"]][["Confidence interval"]][["upper"]]
per_chr_ci_lower <- sapply(file[["Per-chromosome"]],
function(x) x[["Admixture"]][["Confidence interval"]][["lower"]])
per_chr_ci_upper <- sapply(file[["Per-chromosome"]],
function(x) x[["Admixture"]][["Confidence interval"]][["upper"]])
# Create data frame
df_out <- data.frame("p1" = file[["Populations"]][["P1"]],
"p2" = file[["Populations"]][["P2"]],
"p3" = file[["Populations"]][["P3"]],
"chr" = c("all", names(per_chr_d)),
"d_stat" = c(g_wide_d, per_chr_d),
"z_score" = c(g_wide_z, per_chr_z),
"admix_f" = c(admix_f, per_chr_f),
"f_ci_lower" = c(g_wide_ci_lower, per_chr_ci_lower),
"f_ci_upper" = c(g_wide_ci_upper, per_chr_ci_upper),
stringsAsFactors = F)
return(df_out)
}) %>%
dplyr::bind_rows()
# capitalise species / line names / colnames
rename_vec = c("HdrR", "HNI", "HO5", "HSOK", "iCab", "KW", "MIKK")
names(rename_vec) = c("hdrr", "hni", "ho5", "hsok", "icab", "kiyosu_wild", "mikk")
final_df = final_df %>%
dplyr::mutate(dplyr::across(c("p2", "p3"),
~dplyr::recode(.x, !!!rename_vec)))
colnames(final_df) = dplyr::recode(colnames(final_df),
p1 = "P1",
p2 = "P2",
p3 = "P3")
# save
write.table(final_df, "mikk_genome/data/20201020_abbababa_final.txt",
quote = F, sep = "\t", row.names = F)
# Read in data
final_df <- read.table(here("data", "20201020_abbababa_final.txt"),
header = T,
sep = "\t",
as.is = T)
# Set colours
cols <- c("#F3B61F", "#631E68", "#F6673A", "#F33A56", "#55B6B0", "#08605F", "#002642", "#B02156")
names(cols) <- c("HdrR", "HSOK", "HNI", "melastigma", "javanicus", "KW", "HO5", "iCab")
# Factorise chr, P1 and P2 to get them in the right order
chr_order <- c(seq(1,24), "all")
fish_order <- c("MIKK", "KW", "HO5", "iCab", "HdrR", "HSOK", "HNI")
final_df <- final_df %>%
dplyr::mutate(across(P2,
~factor(.x, levels = fish_order))) %>%
dplyr::mutate(chr = factor(chr, levels = chr_order))
NOTE: There are 142 rows with an admix_f value over 1?
final_df %>%
dplyr::filter(admix_f > 1) %>%
dplyr::count(P2, P3)
melastigma and javanicuscor_df <- final_df %>%
# filter for when P1 is another Oryzias, and P2
dplyr::filter(P1 %in% c("javanicus", "melastigma") & P2 != "MIKK" & P3 == "MIKK") %>%
# pivot to put the admixture_f stat for melastigma and javanicus in the same row
tidyr::pivot_wider(id_cols = c("P2", "chr"),
names_from = P1,
values_from = c(admix_f, f_ci_lower, f_ci_upper))
cor_df$chr <- as.character(cor_df$chr)
cor_df$chr <- ifelse(cor_df$chr == "all", "genome-wide", cor_df$chr)
chr_order_plot <- c(seq(1,24), "genome-wide")
cor_df$chr <- factor(cor_df$chr, levels = chr_order_plot)
cor_df_means <- cor_df %>%
# apply across rows
dplyr::rowwise() %>%
# get means for f and CIs
dplyr::mutate(mean_f = mean(c(admix_f_javanicus, admix_f_melastigma)),
mean_ci_upper = mean(c(f_ci_upper_javanicus, f_ci_upper_melastigma)),
mean_ci_lower = mean(c(f_ci_lower_javanicus, f_ci_lower_melastigma))) %>%
# set stats at a maximum of 1
dplyr::mutate(across(c("mean_f", "mean_ci_upper", "mean_ci_lower"),
~dplyr::if_else(.x > 1,
1,
.x)))
cor_df_means %>%
ggplot(aes(P2, mean_f, fill = P2)) +
geom_col() +
geom_errorbar(aes(ymin = mean_ci_lower,
ymax = mean_ci_upper),
position = position_dodge(0.9),
width = 0.25) +
guides(fill = F) +
facet_wrap(~chr) +
ylim(0,1) +
ylab(expression(paste("Mean ", italic("f"), " statistic"))) +
theme_bw(base_size = 8) +
theme(panel.grid = element_blank()) +
scale_fill_manual(values = cols)
ggsave(filename = paste("20201020_f_stat_by_chr_w-KW-icab-ho5", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
cor_df_means %>%
dplyr::filter(P2 %in% c("HSOK", "HNI", "HdrR")) %>%
ggplot(aes(P2, mean_f, fill = P2)) +
geom_col() +
geom_errorbar(aes(ymin = mean_ci_lower,
ymax = mean_ci_upper),
position = position_dodge(0.9),
width = 0.25) +
guides(fill = F) +
facet_wrap(~chr) +
ylim(0,1) +
ylab(expression(paste("Mean ", italic("f"), " statistic"))) +
theme_bw() +
theme(panel.grid = element_blank()) +
scale_fill_manual(values = cols)
ggsave(filename = paste("20201020_f_stat_by_chr", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
# get vector of ONT samples
ont_samples <- read.table(here("data", "20201019_ont_samples.txt"))[, 1]
# plot separate panel for each of the ONT samples
lapply(ont_samples, function(x){
# create DF
cor_df <- final_df %>%
# filter for when P1 is another Oryzias, and P2
dplyr::filter(P1 %in% c("javanicus", "melastigma") & P2 %in% c("HdrR", "HNI", "HSOK") & P3 == x) %>%
# pivot to put the admixture_f stat for melastigma and javanicus in the same row
tidyr::pivot_wider(id_cols = c("P2", "chr"),
names_from = P1,
values_from = c(admix_f, f_ci_lower, f_ci_upper)) %>%
# apply across rows
dplyr::rowwise() %>%
# get means for f and CIs
dplyr::mutate(mean_f = mean(c(admix_f_javanicus, admix_f_melastigma)),
mean_ci_upper = mean(c(f_ci_upper_javanicus, f_ci_upper_melastigma)),
mean_ci_lower = mean(c(f_ci_lower_javanicus, f_ci_lower_melastigma)))
# set stats at a maximum of 1
# dplyr::mutate(across(c("mean_f", "mean_ci_upper", "mean_ci_lower"),
# ~dplyr::if_else(.x > 1,
# 1,
# .x)))
# Create plot
out_plot <- cor_df %>%
ggplot(aes(P2, mean_f, fill = P2)) +
geom_col() +
geom_errorbar(aes(ymin = mean_ci_lower,
ymax = mean_ci_upper),
position = position_dodge(0.9),
width = 0.25) +
guides(fill = F) +
facet_wrap(~chr) +
#ylim(0,1) +
ylab(expression(paste("Mean ", italic("f"), " statistic"))) +
theme_bw() +
theme(panel.grid = element_blank()) +
scale_fill_manual(values = cols) +
ggtitle(x)
# Save
ggsave(filename = paste("20201020_", x, ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/20201020_ont_samples/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
})
mkdir geno
# Pull out genotypes
bcftools view \
--min-alleles 2 \
--max-alleles 2 \
--types snps \
--output-type u \
vcfs/full-run_line-ids.vcf |\
bcftools query \
--format '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%TGT]\n' \
--output geno/20201020_full_vcf.geno
## replace all . with N
sed 's/\./N/g' geno/20201020_full_vcf.geno > geno/20201021_full_vcf_bcftools.geno
# use Simon Martin's parseVCF.py script
python abbababba_tutorial/genomics_general-master/VCF_processing/parseVCF.py \
-i vcfs/full-run_line-ids.vcf \
--skipIndels \
--minQual 30 \
> geno/20201021_full_vcf.geno
# Doesn't remove indels.
# Split by chr
mkdir geno/20201021_full_vcf_bcftools_split-by-chr
for i in $( seq 1 24 ) ; do
script=mikk_genome/code/scripts/20201021_split_geno_by_chr.R ;
chr=$i ;
in_file=geno/20201021_full_vcf_bcftools.geno ;
out_file=geno/20201021_full_vcf_bcftools_split-by-chr/$chr.geno ;
bsub \
-M 50000 \
-o log/20201021_split_geno_by_chr_$i.out \
-e log/20201021_split_geno_by_chr_$i.err \
"Rscript --vanilla $script $chr $in_file $out_file " ;
done
.geno format and bind to VCF genomkdir geno/20201021_final
# make samples file
bcftools query -l vcfs/full-run_line-ids.vcf \
> mikk_genome/data/20201021_full-run_line-ids_samples.txt
for i in $( seq 1 24 ) ; do
script=mikk_genome/code/scripts/20201021_emf_to_geno.R ;
chr=$i ;
emf_in_file=emfs/final_with_rc/$chr.txt ;
vcf_gen_file=geno/20201021_full_vcf_bcftools_split-by-chr/$chr.geno ;
vcf_samples=mikk_genome/data/20201021_full-run_line-ids_samples.txt ;
out_file=geno/20201021_final/$chr.txt ;
bsub \
-M 50000 \
-o log/20201021_create_final_geno_$chr.out \
-e log/20201021_create_final_geno_$chr.err \
"Rscript --vanilla $script $emf_in_file $vcf_gen_file $vcf_samples $out_file " ;
done
# make no-sibs version
mkdir geno/20201022_final_no-sibs/
for i in $( seq 1 24 ) ; do
script=mikk_genome/code/scripts/20201022_remove_sibs.R ;
chr=$i ;
in_file=geno/20201021_final/$chr.txt ;
sibs_file=mikk_genome/data/20200227_panel_lines_excluded.txt ;
out_file=geno/20201022_final_no-sibs/$chr.txt ;
bsub \
-M 20000 \
-o log/20201022_tidy_geno_$chr.out \
-e log/20201022_tidy_geno_$chr.err \
"Rscript --vanilla $script $in_file $sibs_file $out_file " ;
done
Combine into single file
files <- list.files("geno/20201021_final",
pattern = "[0-9]",
full.names = T)
# read in data
dat_list <- lapply(files, function(x){
# read in data
df <- read.table(x,
header = T,
sep = "\t",
as.is = T,
comment.char = "*",
check.names = F)
return(df)
})
# bind together and sort
final_df <- dplyr::bind_rows(dat_list) %>% # bind into single DF
dplyr::arrange("#CHROM", "POS") # sort by chromosome, then position
# write table
write.table(final_df, "geno/20201021_final/all.txt",
quote = F,
sep = "\t",
row.names = F)
# create a no-sibs version
## create vector of siblines to be excluded
excl_sibs = scan("mikk_genome/data/20200227_panel_lines_excluded.txt", character())
## remove columns from final_df
final_df_nosibs = final_df %>%
dplyr::select(-all_of(excl_sibs))
## write table
write.table(final_df_nosibs, "geno/20201021_final/all_nosibs.txt",
quote = F,
sep = "\t",
row.names = F)
pop.txt file directly on clusteremf_samples = c("hdrr", "hni", "hsok", "javanicus", "melastigma", "ancestor")
vcf_samples = scan("mikk_genome/data/20201021_full-run_line-ids_samples.txt", character())
pop_out = data.frame("samples" = c(emf_samples, vcf_samples),stringsAsFactors = F)
pop_out$population = ifelse(pop_out$samples %in% emf_samples,
pop_out$samples,
ifelse(grepl("iCab", pop_out$samples),
"icab",
ifelse(grepl("Ho5", pop_out$samples),
"ho5",
ifelse(grepl("KW", pop_out$samples),
"kiyosu_wild",
"mikk"))))
write.table(pop_out,
"mikk_genome/data/20201021_abba.pop.txt",
quote = F, sep = "\t", row.names = F, col.names = F)
mkdir abba_baba_mikk/20201022_sliding_windows
# TEST
python abbababba_tutorial/genomics_general-master/ABBABABAwindows.py \
-g geno/20201022_final_no-sibs/1.txt \
-f phased \
-o abba_baba_mikk/20201022_sliding_windows/1.txt \
-P1 melastigma \
-P2 hsok \
-P3 mikk \
-O ancestor \
--popsFile mikk_genome/data/20201022_abba_nosibs.pop.txt \
-w 25000 \
-m 250
# Works!
# TRUE
for i in $( echo melastigma javanicus ) ; do
for j in $( echo hdrr hni hsok ) ; do
for k in $( seq 1 24 ) ; do
script=abbababba_tutorial/genomics_general-master/ABBABABAwindows.py
p1=$i ;
p2=$j ;
chr=$k ;
popsfile=mikk_genome/data/20201022_abba_nosibs.pop.txt ;
bsub \
-M 20000 \
-o log/20201022_abba_sliding_$p1\_$p2\_$chr.out \
-e log/20201022_abba_sliding_$p1\_$p2\_$chr.err \
"python $script \
-g geno/20201022_final_no-sibs/$chr.txt \
-f phased \
-o abba_baba_mikk/20201022_sliding_windows/$p1\_$p2\_$chr.txt \
-P1 $p1 \
-P2 $p2 \
-P3 mikk \
-O ancestor \
--popsFile $popsfile \
-w 25000 \
-m 250"
done ;
done ;
done
library(tidyverse)
# get full paths of target files
files = list.files("abba_baba_mikk/20201022_sliding_windows",
full.names = T)
basename(files) %>%
stringr::str_split("_", simplify = T)
# read in files
df = lapply(files, function(x){
# read in data
out_df = read.csv(x)
# get populations
split_name = basename(x) %>%
stringr::str_split("_", simplify = T)
p1 = split_name[, 1]
p2 = split_name[, 2]
# add to data
out_df$p1 = p1
out_df$p2 = p2
return(out_df)
}) %>%
# bind into DF
dplyr::bind_rows()
# write file
write.table(df,
"mikk_genome/data/20201022_abba_sliding_windows.txt",
quote = F,
sep = "\t",
row.names = F)
# Read in data
df = read.table(here("data", "20201022_abba_sliding_windows.txt"), header = T, sep = "\t", as.is = T)
# Convert fd to 0 if D < 0
df$fd = ifelse(df$D < 0,
0,
df$fd)
# Change names
df = df %>%
dplyr::mutate(p2 = recode(df$p2, hdrr = "HdrR", hni = "HNI", hsok = "HSOK"))
# Set colours
cols <- c("#F3B61F", "#631E68", "#F6673A", "#F33A56", "#55B6B0", "#08605F", "#002642", "#B02156")
names(cols) <- c("HdrR", "HSOK", "HNI", "melastigma", "javanicus", "KW", "HO5", "iCab")
## Factorise chr, P1 and P2 to get them in the right order
#chr_order <- c(seq(1,24), "all")
#fish_order <- c("MIKK", "KW", "HO5", "iCab", "HdrR", "HSOK", "HNI")
#
#final_df <- final_df %>%
# dplyr::mutate(across(P2,
# ~factor(.x, levels = fish_order))) %>%
# dplyr::mutate(chr = factor(chr, levels = chr_order))
df %>%
dplyr::filter(p1 == "melastigma") %>%
ggplot() +
geom_line(aes(mid, fd, colour = p2)) +
facet_wrap(~scaffold, nrow = 24, ncol = 1) +
scale_colour_manual(values = cols) +
theme_bw(base_size = 10) +
scale_x_continuous(breaks = c(0, 5000000, 10000000, 15000000, 20000000, 25000000, 30000000, 35000000),
labels = scales::comma) +
xlab("Base position") +
ylab(bquote(italic(f[d]))) +
labs(colour = "P2")
ggsave(filename = paste("20201022_fd_by_chr", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/",
width = 24.75,
height = 50,
units = "cm",
dpi = 300)
karyoploteRif (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("karyoploteR")
library(karyoploteR)
BSgenome::available.genomes()
No medaka. Need to make custom.
# Get chromosome lengths
med_chr_lens = read.table(here("data", "Oryzias_latipes.ASM223467v1.dna.toplevel.fa_chr_counts.txt"),
col.names = c("chr", "end"))
# Add start
med_chr_lens$start = 1
# Reorder
med_chr_lens = med_chr_lens %>%
dplyr::select(chr, start, end)
# Create custom genome
med_genome = regioneR::toGRanges(med_chr_lens)
# Create plot
kp = karyoploteR::plotKaryotype(genome = med_genome)
# filter for melastigma
df_kp = df %>%
dplyr::filter(p1 == "melastigma")
# make chr and fd numeric
df_kp$scaffold <- as.numeric(df_kp$scaffold)
# Pull out HdrR, HNI and HSOK columns
mkdir geno/20201023_abba_sw_granges
for i in $( seq 1 24 ) ; do
chr=$i ;
in_file=geno/20201022_final_no-sibs/$chr.txt ;
out_file=geno/20201023_abba_sw_granges/$chr.txt ;
awk '{print $1,$2,$3,$4,$5}' $in_file \
> $out_file ;
done
# Create final output directory
mkdir geno/20201023_abba_sw_granges/final
library(tidyverse)
# Read files into R
files = list.files("geno/20201023_abba_sw_granges", full.names = T)
df = lapply(files, function(x){
out = read.table(x,
header = T,
as.is = T,
comment.char = "*",
check.names = F)
return(out)
}) %>%
dplyr::bind_rows() %>%
dplyr::select(chr = "#CHROM",
pos = "POS",
everything()) %>%
dplyr::mutate(dplyr::across(hdrr:hsok,
~ifelse(.x == "N/N", NA, 1)))
write.table(df,
file = "geno/20201023_abba_sw_granges_final/20201023_hdrr_hni_hsok.txt",
quote = F,
sep = "\t",
row.names = F)
# Pull to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/geno/20201023_abba_sw_granges_final/20201023_hdrr_hni_hsok.txt ~/Documents/Data/20201012_mikk_genome
# Make version with just HNI and HSOK
awk -v OFS="\t" '{print $1, $2, $4, $5}' \
geno/20201023_abba_sw_granges_final/20201023_hdrr_hni_hsok.txt \
> geno/20201023_abba_sw_granges_final/20201023_hni_hsok.txt
# Read in file on local
ol_ranges_df = read.table("~/Documents/Data/20201012_mikk_genome/20201023_hdrr_hni_hsok.txt",
header = T,
sep = "\t",
as.is = T)
ol_ranges_df_long = ol_ranges_df %>%
tidyr::pivot_longer(cols = c(hdrr, hni, hsok), names_to = "line", values_to = "present")
ol_ranges_list = split(ol_ranges_df_long, f = ol_ranges_df_long$line)
ol_ranges_list = lapply(ol_ranges_list, function(x){
# remove NAs
df = x %>%
tidyr::drop_na(present)
# convert to GRanges object
ol_ranges = GenomicRanges::makeGRangesFromDataFrame(df,
ignore.strand = T,
seqnames.field = "chr",
start.field = "pos",
end.field = "pos")
return(ol_ranges)
})
# Convert to GRanges object
library(GenomicRanges)
ol_ranges = GenomicRanges::makeGRangesFromDataFrame(ol_ranges_df,
ignore.strand = T,
seqnames.field = "chr",
start.field = "pos",
end.field = "pos")
bcftools view \
--min-alleles 2 \
--max-alleles 2 \
--types snps \
--output-type u \
vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz |\
bcftools query \
--format '%CHROM\t%POS\n' \
--output geno/20201023_abba_sw_granges_final/20201023_mikk.txt
gzip -k geno/20201023_abba_sw_granges_final/20201023_mikk.txt
# Pull to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/geno/20201023_abba_sw_granges_final/20201023_mikk.txt.gz ~/Documents/Data/20201012_mikk_genome
saveRDS(mikk_ranges, "~/Desktop")
cannot open file '/Users/brettell/Desktop': it is a directoryError in gzfile(file, mode) : cannot open the connection
# Get list of exons from biomaRt
library("biomaRt")
# Select dataset
olat_mart = useEnsembl(biomart = "ensembl", dataset = "olatipes_gene_ensembl")
# Get attributes of interest (exon ID, chr, start, end)
exons <- getBM(attributes = c("chromosome_name", "ensembl_gene_id", "ensembl_transcript_id", "transcript_start", "transcript_end", "transcript_length", "ensembl_exon_id", "rank", "strand", "exon_chrom_start", "exon_chrom_end", "cds_start", "cds_end"),
mart = olat_mart)
# Convert exons to GRanges
ex_ranges = GenomicRanges::makeGRangesFromDataFrame(exons,
ignore.strand = T,
seqnames.field = "chromosome_name",
start.field = "exon_chrom_start",
end.field = "exon_chrom_end")
# Plot
kp <- plotKaryotype(med_genome, chromosomes = "2")
# Add base numbers
karyoploteR::kpAddBaseNumbers(kp, tick.dist = 5000000, cex = 0.3)
# Add data backgrounds
karyoploteR::kpDataBackground(kp, r0=0, r1 = 1, color = "white")
# Add axis label
kpAxis(kp, r0=0.6, r1 = 1, cex = 0.4)
# Add fd data
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HNI"],
x = df_kp$mid[df_kp$p2 == "HNI"],
y = df_kp$fd[df_kp$p2 == "HNI"],
col = "#F6673A",
r0=0.6, r1 = 1)
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HdrR"],
x = df_kp$mid[df_kp$p2 == "HdrR"],
y = df_kp$fd[df_kp$p2 == "HdrR"],
col = "#F3B61F",
r0=0.6, r1 = 1)
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HSOK"],
x = df_kp$mid[df_kp$p2 == "HSOK"],
y = df_kp$fd[df_kp$p2 == "HSOK"],
col = "#631E68",
r0=0.6, r1 = 1)
# Add SNP density data
kpPlotDensity(kp, data=mikk_ranges, col = "#49A379",
r0=0, r1=0.15,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hni, col = "#F6673A",
r0=0.15, r1=0.3,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hsok, col = "#631E68",
r0=0.3, r1=0.45,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hdrr, col = "#F3B61F",
r0=0.45, r1=0.6,
window.size = 25000)
# Add exon density to ideogram
kpPlotDensity(kp, data=ex_ranges, col = "#f77cb5",
data.panel = "ideogram",
window.size = 25000,
r0 = 0.5, r1 = 1)
kpPlotDensity(kp, data=ex_ranges, col = "#f77cb5",
data.panel = "ideogram",
window.size = 25000,
r0 = 0.5, r1 = 0)
# Add labels
kpAddLabels(kp, labels="MIKK",
r0=0, r1=0.15,
cex = 0.4)
kpAddLabels(kp, labels="HNI",
r0=0.15, r1=0.3,
cex = 0.4)
kpAddLabels(kp, labels="HSOK",
r0=0.3, r1=0.45,
cex = 0.4)
kpAddLabels(kp, labels="HdrR",
r0=0.45, r1=0.6,
cex = 0.4)
kpAddLabels(kp, labels=bquote(italic(f[d])),
r0=0.6, r1=1,
label.margin = 0.035,
cex = 0.6)
# Save
png(file="~/Documents/Docs/medaka pics/20200921_introgression/20201023_fd_with_density_chr2.png",
width=1279,
height=357,
units = "px",
res = 400)
# Plot
kp = plotKaryotype(med_genome, chromosomes = "2")
# Add base numbers
karyoploteR::kpAddBaseNumbers(kp, tick.dist = 5000000, cex = 0.3)
# Add data backgrounds
karyoploteR::kpDataBackground(kp, r0=0, r1 = 1, color = "white")
# Add axis label
kpAxis(kp, r0=0.6, r1 = 1, cex = 0.4)
# Add fd data
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HNI"],
x = df_kp$mid[df_kp$p2 == "HNI"],
y = df_kp$fd[df_kp$p2 == "HNI"],
col = "#F6673A",
r0=0.6, r1 = 1)
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HdrR"],
x = df_kp$mid[df_kp$p2 == "HdrR"],
y = df_kp$fd[df_kp$p2 == "HdrR"],
col = "#F3B61F",
r0=0.6, r1 = 1)
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HSOK"],
x = df_kp$mid[df_kp$p2 == "HSOK"],
y = df_kp$fd[df_kp$p2 == "HSOK"],
col = "#631E68",
r0=0.6, r1 = 1)
# Add SNP density data
kpPlotDensity(kp, data=mikk_ranges, col = "#49A379",
r0=0, r1=0.15,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hni, col = "#F6673A",
r0=0.15, r1=0.3,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hsok, col = "#631E68",
r0=0.3, r1=0.45,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hdrr, col = "#F3B61F",
r0=0.45, r1=0.6,
window.size = 25000)
# Add exon density to ideogram
kpPlotDensity(kp, data=ex_ranges, col = "#f77cb5",
data.panel = "ideogram",
window.size = 25000,
r0 = 0.5, r1 = 1)
kpPlotDensity(kp, data=ex_ranges, col = "#f77cb5",
data.panel = "ideogram",
window.size = 25000,
r0 = 0.5, r1 = 0)
# Add labels
kpAddLabels(kp, labels="MIKK",
r0=0, r1=0.15,
cex = 0.4)
kpAddLabels(kp, labels="HNI",
r0=0.15, r1=0.3,
cex = 0.4)
kpAddLabels(kp, labels="HSOK",
r0=0.3, r1=0.45,
cex = 0.4)
kpAddLabels(kp, labels="HdrR",
r0=0.45, r1=0.6,
cex = 0.4)
kpAddLabels(kp, labels=bquote(italic(f[d])),
r0=0.6, r1=1,
label.margin = 0.035,
cex = 0.6)
dev.off()
# Save
png(file="~/Documents/Docs/medaka\ pics/20200921_introgression/20201023_fd_with_density.png",
width=8500,
height=13500,
units = "px",
res = 400)
# Plot
kp = plotKaryotype(med_genome)
# Add base numbers
karyoploteR::kpAddBaseNumbers(kp, tick.dist = 5000000, cex = 0.3)
# Add data backgrounds
karyoploteR::kpDataBackground(kp, r0=0, r1 = 1, color = "white")
# Add axis label
kpAxis(kp, r0=0.6, r1 = 1, cex = 0.4)
# Add fd data
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HNI"],
x = df_kp$mid[df_kp$p2 == "HNI"],
y = df_kp$fd[df_kp$p2 == "HNI"],
col = "#F6673A",
r0=0.6, r1 = 1)
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HdrR"],
x = df_kp$mid[df_kp$p2 == "HdrR"],
y = df_kp$fd[df_kp$p2 == "HdrR"],
col = "#F3B61F",
r0=0.6, r1 = 1)
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HSOK"],
x = df_kp$mid[df_kp$p2 == "HSOK"],
y = df_kp$fd[df_kp$p2 == "HSOK"],
col = "#631E68",
r0=0.6, r1 = 1)
# Add SNP density data
kpPlotDensity(kp, data=mikk_ranges, col = "#49A379",
r0=0, r1=0.15,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hni, col = "#F6673A",
r0=0.15, r1=0.3,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hsok, col = "#631E68",
r0=0.3, r1=0.45,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hdrr, col = "#F3B61F",
r0=0.45, r1=0.6,
window.size = 25000)
# Add exon density to ideogram
kpPlotDensity(kp, data=ex_ranges, col = "#f77cb5",
data.panel = "ideogram",
window.size = 25000,
r0 = 0.5, r1 = 1)
kpPlotDensity(kp, data=ex_ranges, col = "#f77cb5",
data.panel = "ideogram",
window.size = 25000,
r0 = 0.5, r1 = 0)
# Add labels
kpAddLabels(kp, labels="MIKK",
r0=0, r1=0.15,
cex = 0.4)
kpAddLabels(kp, labels="HNI",
r0=0.15, r1=0.3,
cex = 0.4)
kpAddLabels(kp, labels="HSOK",
r0=0.3, r1=0.45,
cex = 0.4)
kpAddLabels(kp, labels="HdrR",
r0=0.45, r1=0.6,
cex = 0.4)
kpAddLabels(kp, labels=bquote(italic(f[d])),
r0=0.6, r1=1,
label.margin = 0.035,
cex = 0.6)
dev.off()
# Save
png(file="~/Documents/Docs/medaka\ pics/20200921_introgression/20201023_fd_with_density_chr2.png",
width=8500,
height=2372,
units = "px",
res = 400)
# Plot
kp = plotKaryotype(med_genome, chromosomes = "2")
# Add base numbers
karyoploteR::kpAddBaseNumbers(kp, tick.dist = 5000000, cex = 0.3)
# Add data backgrounds
karyoploteR::kpDataBackground(kp, r0=0, r1 = 1, color = "white")
# Add axis label
kpAxis(kp, r0=0.6, r1 = 1, cex = 0.4)
# Add fd data
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HNI"],
x = df_kp$mid[df_kp$p2 == "HNI"],
y = df_kp$fd[df_kp$p2 == "HNI"],
col = "#F6673A",
r0=0.6, r1 = 1)
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HdrR"],
x = df_kp$mid[df_kp$p2 == "HdrR"],
y = df_kp$fd[df_kp$p2 == "HdrR"],
col = "#F3B61F",
r0=0.6, r1 = 1)
karyoploteR::kpLines(kp,
chr = df_kp$scaffold[df_kp$p2 == "HSOK"],
x = df_kp$mid[df_kp$p2 == "HSOK"],
y = df_kp$fd[df_kp$p2 == "HSOK"],
col = "#631E68",
r0=0.6, r1 = 1)
# Add SNP density data
kpPlotDensity(kp, data=mikk_ranges, col = "#49A379",
r0=0, r1=0.15,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hni, col = "#F6673A",
r0=0.15, r1=0.3,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hsok, col = "#631E68",
r0=0.3, r1=0.45,
window.size = 25000)
kpPlotDensity(kp, data=ol_ranges_list$hdrr, col = "#F3B61F",
r0=0.45, r1=0.6,
window.size = 25000)
# Add exon density to ideogram
kpPlotDensity(kp, data=ex_ranges, col = "#f77cb5",
data.panel = "ideogram",
window.size = 25000,
r0 = 0.5, r1 = 1)
kpPlotDensity(kp, data=ex_ranges, col = "#f77cb5",
data.panel = "ideogram",
window.size = 25000,
r0 = 0.5, r1 = 0)
# Add labels
kpAddLabels(kp, labels="MIKK",
r0=0, r1=0.15,
cex = 0.4)
kpAddLabels(kp, labels="HNI",
r0=0.15, r1=0.3,
cex = 0.4)
kpAddLabels(kp, labels="HSOK",
r0=0.3, r1=0.45,
cex = 0.4)
kpAddLabels(kp, labels="HdrR",
r0=0.45, r1=0.6,
cex = 0.4)
kpAddLabels(kp, labels=bquote(italic(f[d])),
r0=0.6, r1=1,
label.margin = 0.035,
cex = 0.6)
dev.off()
20201027
# Split samples by chr
for i in $( echo 131_2 139_4 ) ; do
input_file=maf/20201014_indiv_split_with_ab/$i.txt ;
new_path=maf/20201014_indiv_split_with_ab_split-by-chr/$i ;
if [ ! -d "$new_path" ] ; then
mkdir $new_path ;
fi ;
for j in $(cut -f2 $input_file | sort | uniq ) ; do
awk "\$2 == $j" $input_file \
> $new_path/$j.txt ;
done ;
done
# Combine samples
for i in $( seq 1 24 ) ; do
chr=$i ;
in_dir=maf/20201014_indiv_split_with_ab_split-by-chr ;
out_dir=maf/20201017_ont_samples_split-by-chr ;
script=mikk_genome/code/scripts/20201017_combine_ont_sample_mafs.R ;
bsub \
-M 30000 \
-o log/20201027_combine-ont_$chr.out \
-e log/20201027_combine-ont_$chr.err \
"Rscript --vanilla $script $chr $in_dir $out_dir ";
done
# Combine MIKK total, ONT, and subpop MAF tables
for i in $( seq 1 24 ) ; do
script=mikk_genome/code/scripts/20201017_combine_final_and_af_data.R ;
chr=$i ;
in_file_mik=maf/20201015_split-by-chr_all/$chr.txt ;
in_file_sub=maf/20201017_subpops_split-by-chr_all/$chr.txt ;
in_file_ont=maf/20201017_ont_samples_split-by-chr/$chr.txt ;
in_file_emf=emfs/final_with_rc/$chr.txt ;
out_dir=maf/20201019_all ;
bsub \
-M 50000 \
-o log/20201027_frq_tbl_$i.out \
-e log/20201027_frq_tbl_$i.err \
"Rscript --vanilla $script $chr $in_file_mik $in_file_sub $in_file_ont $in_file_emf $out_dir " ;
done
Combine chr files into single file - run directly on cluster
library(tidyverse)
files <- list.files("maf/20201019_all",
pattern = "[0-9]",
full.names = T)
# read in data
dat_list <- lapply(files, function(x){
# read in data
df <- read.table(x,
header = T,
sep = "\t",
as.is = T,
check.names = F)
return(df)
})
# bind together and sort
final_df <- dplyr::bind_rows(dat_list) %>% # bind into single DF
dplyr::arrange(chr, pos) # sort by chromosome, then position
# write table
write.table(final_df, "abba_baba_mikk/20201020_freq_tables/all.txt", quote = F, sep = "\t", row.names = F)
for i in $( echo melastigma javanicus ); do
for j in $( echo hdrr hni hsok ); do
for k in $( echo 131_2 139_4 ) ; do
# don't run if they are the same
if [ $j != $k ] ; then
script=mikk_genome/code/scripts/20201020_run_abbababa.R ;
p1=$i ;
p2=$j ;
p3=$k ;
dat_file=$( echo abba_baba_mikk/20201020_freq_tables/all.txt ) ;
dir_out=$( echo abba_baba_mikk/20201020_rlists ) ;
bsub \
-M 30000 \
-o log/20201027_abbababa_$p1\_$p2\_$p3.out \
-e log/20201027_abbababa_$p1\_$p2\_$p3.err \
"Rscript --vanilla $script $p1 $p2 $p3 $dat_file $dir_out "
fi ;
done ;
done ;
done
library(tidyverse)
# get list of target files
files <- list.files("abba_baba_mikk/20201020_rlists",
full.names = T)
# read into one list
final_lst = lapply(files, readRDS)
# Reduce down to DF
final_df <- lapply(final_lst, function(file){
# Get D stats
g_wide_d <- file[["Genome-wide D"]][["D statistic"]]
per_chr_d <- sapply(file[["Per-chromosome"]],
function(x) x[["D statistic"]])
# Get Z scores
g_wide_z <- file[["Genome-wide D"]][["Z score"]]
per_chr_z <- sapply(file[["Per-chromosome"]],
function(x) x[["Z score"]])
# Get admixture
admix_f <- file[["Admixture"]][["f statistic"]]
per_chr_f <- sapply(file[["Per-chromosome"]],
function(x) x[["Admixture"]][["f statistic"]])
# Get confidence intervals
g_wide_ci_lower <- file[["Admixture"]][["Confidence interval"]][["lower"]]
g_wide_ci_upper <- file[["Admixture"]][["Confidence interval"]][["upper"]]
per_chr_ci_lower <- sapply(file[["Per-chromosome"]],
function(x) x[["Admixture"]][["Confidence interval"]][["lower"]])
per_chr_ci_upper <- sapply(file[["Per-chromosome"]],
function(x) x[["Admixture"]][["Confidence interval"]][["upper"]])
# Create data frame
df_out <- data.frame("p1" = file[["Populations"]][["P1"]],
"p2" = file[["Populations"]][["P2"]],
"p3" = file[["Populations"]][["P3"]],
"chr" = c("all", names(per_chr_d)),
"d_stat" = c(g_wide_d, per_chr_d),
"z_score" = c(g_wide_z, per_chr_z),
"admix_f" = c(admix_f, per_chr_f),
"f_ci_lower" = c(g_wide_ci_lower, per_chr_ci_lower),
"f_ci_upper" = c(g_wide_ci_upper, per_chr_ci_upper),
stringsAsFactors = F)
return(df_out)
}) %>%
dplyr::bind_rows()
# capitalise species / line names / colnames
rename_vec = c("HdrR", "HNI", "HO5", "HSOK", "iCab", "KW", "MIKK")
names(rename_vec) = c("hdrr", "hni", "ho5", "hsok", "icab", "kiyosu_wild", "mikk")
final_df = final_df %>%
dplyr::mutate(dplyr::across(c("p2", "p3"),
~dplyr::recode(.x, !!!rename_vec)))
colnames(final_df) = dplyr::recode(colnames(final_df),
p1 = "P1",
p2 = "P2",
p3 = "P3")
# save
write.table(final_df, "mikk_genome/data/20201020_abbababa_final.txt",
quote = F, sep = "\t", row.names = F)
# Read in data
final_df <- read.table(here("data", "20201020_abbababa_final.txt"),
header = T,
sep = "\t",
as.is = T)
# Set colours
cols <- c("#F3B61F", "#631E68", "#F6673A", "#F33A56", "#55B6B0", "#08605F", "#002642", "#B02156")
names(cols) <- c("HdrR", "HSOK", "HNI", "melastigma", "javanicus", "KW", "HO5", "iCab")
# Factorise chr, P1 and P2 to get them in the right order
chr_order <- c(seq(1,24), "all")
fish_order <- c("MIKK", "KW", "HO5", "iCab", "HdrR", "HSOK", "HNI")
final_df <- final_df %>%
dplyr::mutate(across(P2,
~factor(.x, levels = fish_order))) %>%
dplyr::mutate(chr = factor(chr, levels = chr_order))
NOTE: There are 142 rows with an admix_f value over 1?
final_df %>%
dplyr::filter(admix_f > 1) %>%
dplyr::count(P2, P3)
melastigma and javanicuscor_df <- final_df %>%
# filter for when P1 is another Oryzias, and P2
dplyr::filter(P1 %in% c("javanicus", "melastigma") & P2 != "MIKK" & P3 == "MIKK") %>%
# pivot to put the admixture_f stat for melastigma and javanicus in the same row
tidyr::pivot_wider(id_cols = c("P2", "chr"),
names_from = P1,
values_from = c(admix_f, f_ci_lower, f_ci_upper))
cor_df$chr <- as.character(cor_df$chr)
cor_df$chr <- ifelse(cor_df$chr == "all", "genome-wide", cor_df$chr)
chr_order_plot <- c(seq(1,24), "genome-wide")
cor_df$chr <- factor(cor_df$chr, levels = chr_order_plot)
cor_df_means <- cor_df %>%
# apply across rows
dplyr::rowwise() %>%
# get means for f and CIs
dplyr::mutate(mean_f = mean(c(admix_f_javanicus, admix_f_melastigma)),
mean_ci_upper = mean(c(f_ci_upper_javanicus, f_ci_upper_melastigma)),
mean_ci_lower = mean(c(f_ci_lower_javanicus, f_ci_lower_melastigma))) %>%
# set stats at a maximum of 1
dplyr::mutate(across(c("mean_f", "mean_ci_upper", "mean_ci_lower"),
~dplyr::if_else(.x > 1,
1,
.x)))
cor_df_means %>%
ggplot(aes(P2, mean_f, fill = P2)) +
geom_col() +
geom_errorbar(aes(ymin = mean_ci_lower,
ymax = mean_ci_upper),
position = position_dodge(0.9),
width = 0.25) +
guides(fill = F) +
facet_wrap(~chr) +
ylim(0,1) +
ylab(expression(paste("Mean ", italic("f"), " statistic"))) +
theme_bw(base_size = 8) +
theme(panel.grid = element_blank()) +
scale_fill_manual(values = cols)
ggsave(filename = paste("20201020_f_stat_by_chr_w-KW-icab-ho5", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
cor_df_means %>%
dplyr::filter(P2 %in% c("HSOK", "HNI", "HdrR")) %>%
ggplot(aes(P2, mean_f, fill = P2)) +
geom_col() +
geom_errorbar(aes(ymin = mean_ci_lower,
ymax = mean_ci_upper),
position = position_dodge(0.9),
width = 0.25) +
guides(fill = F) +
facet_wrap(~chr) +
ylim(0,1) +
ylab(expression(paste("Mean ", italic("f"), " statistic"))) +
theme_bw() +
theme(panel.grid = element_blank()) +
scale_fill_manual(values = cols)
ggsave(filename = paste("20201020_f_stat_by_chr", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
lapply(c("131_2", "139_4"), function(x){
# create DF
cor_df <- final_df %>%
# filter for when P1 is another Oryzias, and P2
dplyr::filter(P1 %in% c("javanicus", "melastigma") & P2 %in% c("HdrR", "HNI", "HSOK") & P3 == x) %>%
# pivot to put the admixture_f stat for melastigma and javanicus in the same row
tidyr::pivot_wider(id_cols = c("P2", "chr"),
names_from = P1,
values_from = c(admix_f, f_ci_lower, f_ci_upper)) %>%
# apply across rows
dplyr::rowwise() %>%
# get means for f and CIs
dplyr::mutate(mean_f = mean(c(admix_f_javanicus, admix_f_melastigma)),
mean_ci_upper = mean(c(f_ci_upper_javanicus, f_ci_upper_melastigma)),
mean_ci_lower = mean(c(f_ci_lower_javanicus, f_ci_lower_melastigma)))
# set stats at a maximum of 1
# dplyr::mutate(across(c("mean_f", "mean_ci_upper", "mean_ci_lower"),
# ~dplyr::if_else(.x > 1,
# 1,
# .x)))
# Create plot
out_plot <- cor_df %>%
ggplot(aes(P2, mean_f, fill = P2)) +
geom_col() +
geom_errorbar(aes(ymin = mean_ci_lower,
ymax = mean_ci_upper),
position = position_dodge(0.9),
width = 0.25) +
guides(fill = F) +
facet_wrap(~chr) +
#ylim(0,1) +
ylab(expression(paste("Mean ", italic("f"), " statistic"))) +
theme_bw() +
theme(panel.grid = element_blank()) +
scale_fill_manual(values = cols) +
ggtitle(x)
# Save
ggsave(filename = paste("20201027_", x, ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200921_introgression/20201020_ont_samples/",
width = 24.75,
height = 19.5,
units = "cm",
dpi = 500)
})
end 20201027
ftp://ftp.ensembl.org/pub/release-101/emf/ensembl-compara/multiple_alignments/50_fish.epo/
# Set up directory
mkdir emfs/ensembl_101
mkdir emfs/ensembl_101/raw
# Download raw files
wget -P emfs/ensembl_101/raw ftp://ftp.ensembl.org/pub/release-101/emf/ensembl-compara/multiple_alignments/50_fish.epo/*
# unzip into new directory
mkdir emfs/ensembl_101/unzipped
for i in $( find emfs/ensembl_101/raw/50_fish.epo.[0-9]* ) ; do
name=$( basename $i | cut -f3,4 -d'.') ;
out_dir=emfs/ensembl_101/unzipped ;
bsub \
-o log/20201026_unzip_$name.out \
-e log/20201026_unzip_$name.err \
"zcat $i > $out_dir/$name" ;
done
# Extract segments
mkdir emfs/ensembl_101/segmented
for i in $( find emfs/ensembl_101/unzipped/* ); do
script=./mikk_genome/code/scripts/20201026_extract-emf-segments.sh ;
out_dir=emfs/ensembl_101/segmented ;
# get basename
bname=$(basename $i) ;
bname_short=$(echo ${bname::-4} ) ;
# get chromosome
chr=$(echo $bname | cut -f1 -d"_" ) ;
# make directory for each EMF file
new_path=$out_dir/$bname_short ;
if [ ! -d "$new_path" ] ; then
mkdir $new_path ;
fi
# get segment count
segment_count=$( grep "^DATA" $i | wc -l ) ;
# get segment start and end for each file
for j in $( seq 1 $segment_count ) ; do
bsub \
-e log/20201026_segment_$bname_short\_$j.err \
-o log/20201026_segment_$bname_short\_$j.out \
"$script $i $j $new_path "
done;
done
# How many segments?
find emfs/ensembl_101/segmented/*/*data.txt
# 8,961
## Here the 6_2 file is corrupted! Contains 195 segments.
## Copy from 47_fish.epo from Ensembl 100:
# First segment: 6_18083133_18110990_1
# Last segment: 6_32112995_32194088_1
rm emfs/ensembl_101/segmented/6_2/* # remove from the new folder
# copy from the old one
for i in $( find emfs/segmented/6*/* ) ; do
bname=$( basename $i ) ;
start=$( echo $bname | cut -d"_" -f2 ) ;
end=$( echo $bname | cut -d"_" -f3 ) ;
if (( $start >= 18083133 && $end <= 32194088 )) ; then
cp $i emfs/ensembl_101/segmented/6_2/ ;
fi ;
done
# How many segments?
find emfs/ensembl_101/segmented/*/*.data.txt | wc -l
# 8974
# Clean
mkdir emfs/ensembl_101/cleaned
for i in $( find emfs/ensembl_101/segmented/*/*.data.txt ); do
# make new directory
chr_file=$( echo $i | cut -f4 -d"/" );
new_path=$( echo emfs/ensembl_101/cleaned/$chr_file );
if [ ! -d "$new_path" ]; then
mkdir $new_path;
fi
# create name for new file
bname=$( echo $i | sed 's/.data.txt//g' );
bname_short=$( basename $bname );
script=mikk_genome/code/scripts/20201026_add-hdrr-coords-to-emf-data.R ;
bsub \
-M 10000 \
-o log/20201029_clean_$chr_file\_$bname_short.out \
-e log/20201029_clean_$chr_file\_$bname_short.err \
"Rscript --vanilla $script $bname $new_path";
done
# 115 segments had no other target Oryzias
# Now only 8859
# Consolidate each sub-chr
mkdir emfs/ensembl_101/consolidated
for i in $(find emfs/ensembl_101/cleaned/* -type d ); do
dir_in=$i ;
script=mikk_genome/code/scripts/20200910_consolidate_chr_dat.R ;
# get chromosome
chr=$(echo $i | cut -f4 -d"/" );
out_file=emfs/ensembl_101/consolidated/$chr.txt ;
# consolidate into one file
bsub \
-M 20000 \
-o log/20201030_consol_$chr.out \
-e log/20201030_consol_$chr.err \
"Rscript --vanilla $script $dir_in $out_file ";
done
# combine for each chr
mkdir emfs/ensembl_101/final
for i in $(find emfs/ensembl_101/consolidated/* | cut -f4 -d"/" | cut -f1 -d"_" | sort | uniq ); do
chr=$i ;
script=mikk_genome/code/scripts/20200911_consolidate_chr_files.R ;
in_dir=emfs/ensembl_101/consolidated ;
out_dir=emfs/ensembl_101/final ;
bsub \
-M 50000 \
-o log/20201030_full_chr_$i.out \
-e log/20201030_full_chr_$i.err \
"Rscript --vanilla $script $chr $in_dir $out_dir ";
done
# Combine MIKK total, ONT, and subpop MAF tables
mkdir maf/20201030_all
for i in $( seq 1 24 ) ; do
script=mikk_genome/code/scripts/20201017_combine_final_and_af_data.R ;
chr=$i ;
in_file_mik=maf/20201015_split-by-chr_all/$chr.txt ;
in_file_sub=maf/20201017_subpops_split-by-chr_all/$chr.txt ;
in_file_ont=maf/20201017_ont_samples_split-by-chr/$chr.txt ;
in_file_emf=emfs/ensembl_101/final/$chr.txt ;
out_dir=maf/20201030_all ;
bsub \
-M 50000 \
-o log/20201030_frq_tbl_$i.out \
-e log/20201030_frq_tbl_$i.err \
"Rscript --vanilla $script $chr $in_file_mik $in_file_sub $in_file_ont $in_file_emf $out_dir " ;
done
Combine chr files into single file - run directly on cluster
library(tidyverse)
files <- list.files("maf/20201019_all",
pattern = "[0-9]",
full.names = T)
# read in data
dat_list <- lapply(files, function(x){
# read in data
df <- read.table(x,
header = T,
sep = "\t",
as.is = T,
check.names = F)
return(df)
})
# bind together and sort
final_df <- dplyr::bind_rows(dat_list) %>% # bind into single DF
dplyr::arrange(chr, pos) # sort by chromosome, then position
# write table
write.table(final_df, "abba_baba_mikk/20201020_freq_tables/all.txt", quote = F, sep = "\t", row.names = F)
20201106
mkdir stats/20201106_indiv_stats
for sample in $( bcftools query -l vcfs/full-run_line-ids.vcf.gz ); do
in_file=vcfs/full-run_line-ids.vcf.gz ;
out_file=stats/20201106_indiv_stats/$sample.stats ;
bsub \
-M 20000 \
-o log/20201106_singleton_stats_$sample.out \
-e log/20201106_singleton_stats_$sample.err \
"bcftools view \
--output-type v \
--samples $sample \
$in_file | \
bcftools stats > $out_file " ;
done
# Doesn't work because every SNP will be a singleton if there's only 1 sample in the VCF.
# Full run
vcftools \
--gzvcf vcfs/full-run_line-ids.vcf.gz \
--out stats/20201106_full-run_line-ids \
--singletons
# Non-missing, biallelic SNPs
vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids_no-missing_with-maf_bi-snps.vcf.gz \
--out stats/20201106_panel_no-sibs_line-ids_no-missing_with-maf_bi-snps \
--singletons
# Creates giant files
# Try again with bcftools stats
## Create file with all samples in it
bcftools query -l vcfs/full-run_line-ids.vcf.gz \
> mikk_genome/data/20201106_full-run_line-ids_samples.txt
## Run states using this as the samples file
bcftools stats \
--samples-file mikk_genome/data/20201106_full-run_line-ids_samples.txt \
vcfs/full-run_line-ids.vcf.gz \
> stats/full-run_line-ids_per-sample.stats
## Extract relevant lines
grep "PSC" stats/full-run_line-ids_per-sample.stats \
> mikk_genome/data/20201106_full-run_line-ids_singleton_counts.txt
# Do the same thing per-chromosome
mkdir stats/full-run_line-ids_per-sample_per-chr
for chr in $( seq 1 24 ); do
in_file=vcfs/full-run_line-ids.vcf.gz ;
out_file=stats/full-run_line-ids_per-sample_per-chr/$chr.stats ;
bsub \
-M 20000 \
-o log/20201106_singleton_perchr_$chr.out \
-e log/20201106_singleton_perchr_$chr.err \
"bcftools view \
--output-type v \
--regions $chr \
$in_file | \
bcftools stats \
--samples-file mikk_genome/data/20201106_full-run_line-ids_samples.txt \
> $out_file " ;
done
# Extract relevant lines
mkdir stats/full-run_line-ids_per-sample_per-chr_psc
for chr in $( seq 1 24 ) ; do
grep "PSC" stats/full-run_line-ids_per-sample_per-chr/$chr.stats \
> stats/full-run_line-ids_per-sample_per-chr_psc/$chr.stats ;
done
Consolidate per-chr data on cluster
# List files
files = list.files("stats/full-run_line-ids_per-sample_per-chr_psc",
full.names = T)
# Read into list
df_list = lapply(files, function(x){
# Read df
df = read.table(x,
skip = 1,
sep = "\t",
comment.char = ",",
header = T,
check.names = F)
# fix column names
colnames(df) = gsub("\\[.+\\]|#|\ ", "", colnames(df))
# return df
return(df)
})
# Set names (chromsome)
names(df_list) = gsub(".stats", "", basename(files))
# Bind into single DF
df = df_list %>%
dplyr::bind_rows(.id = "chr")
# Write table
write.table(df, "mikk_genome/data/20201107_singletons_per_chr.txt",
quote = F,
sep = "\t",
row.names = F)
ggsave(filename = paste("20201107_singleton_counts_per_chr", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200602_mikk_genome/",
width = 24,
height = 20,
units = "cm",
dpi = 500)
ggsave(filename = paste("20201106_singleton_counts", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200602_mikk_genome/",
width = 24,
height = 20,
units = "cm",
dpi = 500)
bcftools index \
--stats \
vcfs/panel_no-sibs_line-ids_no-missing_bi-snps_with-af.vcf.gz \
> mikk_genome/data/20201106_non-missing_bi-snp_count.txt
ggsave(filename = paste("20201106_snps_per_mb", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200602_mikk_genome/",
width = 24.125,
height = 12.5,
units = "cm",
dpi = 500)
ggsave(filename = paste("20201106_snps-per-mb_v_exon-props", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200602_mikk_genome/",
width = 24,
height = 20,
units = "cm",
dpi = 500)
# Calculate correlation
cor.test(chr_df$snps_per_megabase, chr_df$prop_cov_exon, method = "spearman")
Spearman's rank correlation rho
data: chr_df$snps_per_megabase and chr_df$prop_cov_exon
S = 3274, p-value = 0.04033
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.4234783
chr_df %>%
ggplot(aes(snps_per_megabase, prop_cov_exon, colour = chr, label = chr)) +
geom_point() +
geom_text(hjust = -0.5) +
theme_bw() +
guides(colour = F) +
xlab("Non-missing, biallelic SNPs per megabase") +
ylab("Proportion of chromosome covered by exons")
NA
NA
ggsave(filename = paste("20201106_snps-per-mb_v_exon-props_scatter", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka pics/20200602_mikk_genome/",
width = 21,
height = 20,
units = "cm",
dpi = 500)
# On cluster (in mikk_genome repo directory)
library(here)
library(tidyverse)
# Read in lines
lines = scan(here::here("data", "20201021_full-run_line-ids_samples.txt"),
what = character())
# Remove non-MIKK lines and duplicates
lines = lines[!grepl(c("iCab|KW|Ho5|-"), lines)]
# Find siblings
sibs = lines %>%
stringr::str_split("_", simplify = T) %>%
subset(select = 1)
# Create vector of sibling lines to keep
to_keep = duplicated(sibs) | duplicated(sibs, fromLast = T)
# Filter and sort
lines_out = sort(lines[to_keep])
# Write to file
write.table(lines_out, here("data", "20201107_sibling_lines.txt"),
quote = F,
row.names = F,
col.names = F)
bcftools view \
--samples-file mikk_genome/data/20201107_sibling_lines.txt \
--output-type z \
--output-file vcfs/mikk_sib-lines-only.vcf.gz \
vcfs/full-run_line-ids.vcf.gz
mkdir plink/20201107_mikk_sib-lines-only
# make BED
plink \
--vcf vcfs/mikk_sib-lines-only.vcf.gz \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--chr-set 24 no-xy \
--chr 1-24 \
--out plink/20201107_mikk_sib-lines-only/20201107
# recode for 012 transposed
plink \
--bfile plink/20201107_mikk_sib-lines-only/20201107 \
--recode A-transpose \
--out plink/20201107_mikk_sib-lines-only/20201107_recode012
# compress
gzip -k plink/20201107_mikk_sib-lines-only/20201107_recode012.traw
gzip -k plink/20201107_mikk_sib-lines-only/20201107.bim
# send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20201107_mikk_sib-lines-only/20201107_recode012.traw.gz ~/Documents/Data/20200707_mikk_ld/20201107_plink_full_set
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20201107_mikk_sib-lines-only/20201107.bim.gz ~/Documents/Data/20200707_mikk_ld/20201107_plink_full_set
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20201107_mikk_sib-lines-only/20201107.bed ~/Documents/Data/20200707_mikk_ld/20201107_plink_full_set
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20201107_mikk_sib-lines-only/20201107.fam ~/Documents/Data/20200707_mikk_ld/20201107_plink_full_set
# decompress
gunzip ~/Documents/Data/20200707_mikk_ld/20201107_plink_full_set/20201107_recode012.traw.gz
gunzip ~/Documents/Data/20200707_mikk_ld/20201107_plink_full_set/20201107.bim.gz
library(gaston)
# try with read.bed.matrix
mikk_full <- gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20201107_plink_full_set/20201107",
rds = NULL)
mikk_geno <- readr::read_tsv(file = "~/Documents/Data/20200707_mikk_ld/20201107_plink_full_set/20201107_recode012.traw",
progress = T,
col_names = T)
── Column specification ─────────────────────────────────────────────────────────────────────────────────
cols(
.default = col_double(),
SNP = col_character(),
COUNTED = col_character(),
ALT = col_character()
)
ℹ Use `spec()` for the full column specifications.